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Abstmct 

The  problem  of  speech  enhancement  presents  many  obstacles  in  the  speech  processing  field. 
This  thesis  develops  several  speech  de-noising  systems  (SDS)  that  can  be  used  in  the  time,  Fourier, 
and  the  wavelet  domains.  We  present  two  different  thresholding  techniques,  the  soft  thresholding 
technique  (STT)  and  the  hard  thresholding  technique  (HTT).  The  application  of  these  thresholding 
techniques  to  noisy  speech  data  is  discussed.  The  combination  of  both  the  Fourier  and  wavelet 
domains  in  speech  de-noising  proves  to  yield  the  best  results  in  terms  of  speech  intelligibility. 
Informal  listening  tests  are  conducted  in  order  to  compare  the  effects  of  using  the  STT,  the  HTT, 
the  noisy  phase,  the  time  domain,  the  Fourier  domain,  and  the  wavelet  domain. 
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NOISE  REDUCTION  FOR  SPEECH  ENHANCEMENT  USING 
NON-LINEAR  WAVELET  PROCESSING 

/.  Introduction 

1.1  Background 

In  recent  years,  many  speech  processing  scholars  have  developed  speech  systems  that  have 
some  degree  of  success  when  used  with  speech  data  acquired  under  near-ideal  conditions.  By  far 
the  majority  of  recognition  and  encoding  schemes  have  been  developed  and  tested  using  speech 
recorded  on  very  sophisticated  equipment  in  a  quiet  environment.  As  speech  processing  has  moved 
from  the  ideal  laboratory  conditions  to  the  field,  it  has  become  significantly  important  to  face  the 
problems  imposed  by  the  presence  of  ambient  noise.  Once  in  the  real  world,  most  of  the  speech 
processing  systems,  especially  speech  recognition  and  speech  encoding  systems,  fall  very  short  on 
their  promises.  Speech  degraded  by  ambient  noise  has  most  of  its  formant’s  structure  detectable  by 
the  human  listener,  however,  the  human  listener  cannot  listen  to  speech  under  degraded  conditions 
for  a  long  time  without  suffering  auditory  fatigue  (19).  In  order  to  reduce  the  effects  of  ambient 
noise,  many  techniques  for  enhancement  of  noisy  speech  have  been  developed. 

The  main  objective  of  the  speech  enhancement  is  to  attenuate  the  intensity  of  the  noise, 
while  preserving  the  overall  structure  (i.e.,  pitch,  formants,  etc.)  and  intelligibility  of  speech.  In 
particular,  the  military  environment  is  one  of  the  most  crucial  environments  where  speech  data 
is  vulnerable  to  ambient  noise,  especially  noise  due  to  the  engines  of  tanks,  military  vehicles, 
helicopters,  airplanes,  and  others. 
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1.2  Problem  Statement 


The  problem  considered  in  this  thesis  is  to  enhance  noisy  speech  data  and  still  preserve 
intelligibility.  In  order  to  accomplish  this  goal,  we  propose  to  develop  a  speech  processing  scheme 
using  both  wavelets  and  the  thresholding  techniques. 

The  United  States  military  is  carrying  on  intensive  research  in  order  to  develop  systems  that 
are  very  reliable  and  very  robust  in  enhancing  speech  data  degraded  by  ambient  noise.  One  of  the 
new  areas  of  this  research  is  the  use  of  wavelets  in  order  to  explore  their  unique  filtering  abilities 
with  noisy  speech  data.  In  the  last  decade,  the  theory  of  wavelets  has  grown  significantly,  and  has 
promised  to  change  both  signal  and  information  processing.  The  major  advantage  of  wavelets  over 
the  classic  signal  processing  tools  (i.e.,  Fourier  transform),  is  their  unique  ability  to  decompose  a 
signal  into  orthogonal  resolution  levels.  This  unique  property,  makes  wavelets  one  of  the  best  tools 
to  use  with  signals  composed  of  many  high  energy  peaks  of  frequencies,  such  as  speech. 

In  general,  noise  is  a  broad-band  signal.  The  ability  of  wavelets  to  decompose  a  signal  into 
various  bands  of  frequencies,  allows  us  to  locate  noise  at  certain  frequency  bands  and  eliminate  it, 
however,  at  the  expense  of  affecting  the  formants  structure  of  the  signal  degraded  by  this  noise.  In 
order  to  avoid  the  distortion  of  the  underlying  signal,  we  resort  to  the  use  of  many  thresholding 
techniques  which  are  based  on  the  general  statistics  of  the  ambient  noise.  Hard  thresholding  is  a 
technique  that  eliminates  all  data  samples  below  a  fixed  threshold  in  absolute  value.  On  the  other 
hand,  soft  thresholding  is  a  technique  that  eliminates  all  data  samples  below  a  fixed  threshold  in 
absolute  value,  and  pulls  towards  zero  all  data  above  the  threshold,  by  the  amount  of  the  threshold 
in  absolute  value.  The  use  of  thresholding  helps  decrease  the  amount  of  noise,  while  preserving 
most  of  formants’  structure  of  the  underlying  signal. 
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t.S  Scope 


This  thesis  is  limited  to  the  development  of  different  speech  de-noising  systems  to  process 
speech  to  which  various  amounts  of  white  Gaussian  noise  have  been  added  (signal-to-noise  ratios 
vary  between  -lOdb  to  lOdb).  These  systems  are  based  on  the  use  of  wavelets,  Fourier,  and  non¬ 
linear  statistical  processing  of  speech  data  from  the  TIMIT  data  base.  Quantitative  squared  error 
criteria  and  qualitative  listening  tests  are  performed. 

No  attempt  to  automatically  determine  pitch,  silent,  voiced,  or  unvoiced  portions  is  made. 
These  are  assumed  to  be  known.  The  algorithm  developed  is  intended  to  be  one  subsystem  of  a 
pre-processor  used  to  remove  noise  from  noisy  speech  before  use  by  other  speech  processing  systems 
(e.g.,  speech  identification,  speech  recognition,  etc)  or  by  human  listeners. 

The  necessary  mathematical  background  in  wavelets,  Fourier,  and  non-linear  statistical  meth¬ 
ods,  which  are  necessary  to  understand  the  de-noising  systems  developed  in  this  thesis  is  presented. 

1.4  Approach 

The  noisy  speech  signal  is  decomposed  into  voiced,  unvoiced,  and  silent  portions.  The  silent 
portions  are  used  to  estimate  the  variance  of  the  noise  which  is  assumed  to  be  white  Gaussian  noise. 
The  voiced  portions  are  subjected  to  the  thresholding  techniques.  Depending  on  the  method  used, 
we  may  process  speech  in  time,  frequency  (Fourier),  wavelet,  or  any  combination  of  these  three 
domains.  The  phase  of  the  noisy  voiced  speech  may  be  saved  before  processing  the  noisy  voiced 
speech.  On  the  other  hand,  both  the  unvoiced  and  silent  portions  are  multiplied  by  a  ratio  to  be 
discussed  later.  Before  processing  any  speech  segment,  each  portion  (i.e.,  voiced,  unvoiced,  and 
silent)  is  multiplied  by  a  window  function  to  be  defined  later. 
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1.5  Objeetaes 


The  objectives  of  this  research  are  to  answer  the  following  four  questions: 

1.  Can  we  enhance  noisy  speech  by  applying  both  wavelets  and  the  thresholding  tediniqnes? 

2.  Under  what  conditions  do  the  application  of  wavelets  and  the  thresholding  techniques  to  noisy 
speech  data  yield  intelligible  results? 

3.  Can  we  use  both  wavelets  and  Fourier  anai3r8is  to  enhance  noisy  speech? 

4.  How  do  wavelets  and  the  thresholding  techniques  affect  the  quality  of  the  de-noised  speech? 

1.6  Equipment  and  Materials 

The  following  tools  were  crucial  to  this  research: 

1.  SPARC  2  workstations  is  used  foe  cod'ug  and  testing  purposes. 

2.  ANSI  C  is  the  programming  language  for  ail  codes  developed  for  this  research. 

3.  Mathematicais  used  for  developing  gtaphs.aad  bar-diaits. 

4.  ESPS-4  (Entropic  Signal  Processing  System)  is  used  for  both  spectrograms  and  listening  tests. 

5.  is  used  to  typeset  this  document. 

6.  TIMIT  data  base. 

1.7  Organuation 

In  duq>ter  two,  we  present  past  and  current  reseaidi  in  the  area  of  enhancement  of  noisy 
speech.  In  dieter  three,  we  discuss  the  necessary  wavdet,  Fbnrier,  and  thresholding  theories. 
Based  on  the  results  and  theories  of  chapter  three,  we  present,  in  chapter  four,  ei^t  de-noising 
systems.  In  diapter  five,  we  test  the  de-noising  systems  of  ch^>ter  four  with  actnal  noisy  speedi 
data  and  analyze  the  results  in  terms  of  both  error  and  spectrogram  analyns  as  wdl  as  informal 
listening  tests.  Finally,  in  chapter  six,  we  present  the  thesis  condnsions  and  recommendations. 
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11.  Literature  Review 


2.1  Introduction 

This  chapter  focuses  on  evaluating  past  techniques  and  research  in  the  area  of  enhancing  noisy 
speech.  These  techniques  cover  several  methods  used  to  solve  the  problem  of  eliminating  some  of  the 
noise  from  a  speech  signal.  Because  of  the  similarities  between  the  different  techniques,  we  present 
each  method  in  chronological  order  in  order  to  understand  some  of  the  problems  encountered  in 
the  field  of  speech  processing. 

2.2  Recent  Developments  In  Enhancing  Noisy  Speech 

Enhancing  noisy  speech  presents  three  major  problems: 

a.  detecting  the  presence  of  noise. 

b.  estimating  the  noise  power. 

c.  differentiating  between  speech  and  non-speech  signals. 

The  quality  and  intelligibility  of  the  resulting  speech  signal  depend  on  the  method  used  and  on  the 
assumptions  made  to  locate  and  estimate  the  noise. 

2.2.1  Suppression  of  Acoustic  Noise  In  Speech  Using  Spectral  Subtraction.  In  1979,  Steven 
Boll  presented  a  simple  technique  (Spectral  Subtraction)  to  enhance  speech  degraded  by  additive 
white  noise  (3).  His  technique  (among  the  best  techniques  during  the  early  eighties)  is  well  known 
in  the  speech  processing  field.  His  algorithm  measures  the  signal  present  during  non-speech  activity 
and  use  it  as  an  estimate  of  the  noise.  The  spectrum  of  the  estimated  noise  is  then  subtracted  from 
that  of  the  noisy  speech  .  If  we  assume  that  speech  is  a  stationary  signal  and  that  the  noise  is 
additive  and  uncorrelated,  then  we  can  present  the  noisy  speech  signal  as 

»(0  =  «(*)  + »(*).  (2.1) 
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where  «  and  n  are  the  speech  and  noise  signals,  respectively,  where  both  are  real.  Taking  the 
Fourier  transform  (see  equation  3.137)  of  equation  2.1,  we  obtain 

y(u)  =  i(w)  +  n(w).  (2.2) 

The  power  of  the  above  spectra  is  given  by 

|y(w)|*  =  |s(<j)|*  +  |n((*>)|*  +  2[Re(s(w))Re(n(u;)]  +  Iiii(s(a»)]Im(n(a>)]] .  (2.3) 

Since  the  noise  and  signal  random  variables  are  assumed  to  be  uncorrelated,  the  expected  value 
(see  equation  3.7)  of  the  crossproduct  terms  of  equation  2.3  are  eliminated  and  the  expected  power 
spectra  can  then  be  related  by  (19) 

|y(a>)|’  =  |»«(w)|*  +  |ne(w)|*,  (2.4) 

where  |ne(u/)p  and  |sc(ci;)|^  are  estimates  of  the  noise  and  speech  powers,  respectively. 

If  we  can  obtain  a  satisfactory  estimate  of  |n(a;)|^,  we  can  recover  |s((i;)p  by  using  equation 
2.4,  since  we  know  the  power  |y((t;)p.  In  practice,  the  noise  is  estimated  by  observing  the  signal 
during  non-speech  activity  (19).  The  result  is 

=  |y(w)l*  -  l««(‘*')l*-  (2.5) 

Using  the  results  from  equation  2.5,  Boll  subtracted  the  magnitude  spectra  themselves  instead  of 
the  power  spectra,  and  since  the  magnitude  is  a  positive  quantity,  any  negative  output  is  set  to 
zero  (19).  The  above  process  can  be  viewed  as  a  filtering  operation  defined  by 

l»e(w)l  =  \y{u)\  -  |ne(u;)| 
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“if) 

l»(w)IMw). 


(2.6) 


where  the  filter  h  is  given  by 


(2.7) 


where  0  <  |h((i;)|  <  1.  Since  the  negative  amplitudes  are  not  allowed,  BoU  used  the  filter  h  to  define 
a  half-wave  rectification  filter  hji  as 


i„(„)  = 


(2.8) 


In  order  to  recover  the  estimated  speech  signal  8g(t),  we  need  to  take  the  inverse  Fourier 
transform  (see  figure  2.1).  However,  we  need  the  phase  of  Se(a>).  BoU  approximated  this  phase  by 
the  phase  of  the  known  noisy  signal  y{uf).  The  recovered  signal  can  then  be  obtained  using  the 
foUowing  equation 


««(<*;)  =  |s«(w)|e’*, 


(2.9) 


where  6  is  the  phase  of  y{w). 

In  order  to  account  for  the  case  where  the  speech  is  absent,  BoU  modified  his  algorithm  to 
aUow  a  second  pass  to  further  reduce  the  residual  noise  left  after  the  appUcation  of  the  spectral 
subtraction.  The  residual  noise  can  be  further  attenuated  without  distorting  the  speech  signal(3). 

2.2.2  Speech  Enhancement  By  Fourier- Bessel  Coefficients  Of  Speech  And  Noise.  In  1990, 
F.S.  Gurgen  and  C.S.  Chen  introduced  an  enhancement  technique  for  noisy  speech  based  on  the 
Fourier-bessel  (FB)  expansion  of  the  speech  and  noise  (11).  The  method  is  based  on  the  subtraction 


2-3 


Figure  2.1  Spectral  Subtraction  By  Steven  Boll 


of  the  FB  coefficients  of  the  estimated  noise  from  the  coefficients  of  the  noisy  speech.  The  difference 
in  two  sets  of  coefficients  is  then  used  to  synthesize  the  enhanced  speech  . 

2.2.2. 1  Spectral  Properties  of  Fowier~Bessel  Coefficients.  The  solution  of  the  wave 
equation  inside  cylindrical  structures  (tubes)  includes  the  first  kind  of  the  Bessel  function  (22).  In 
their  method,  Gurgen  and  Chen  model  the  vocal  tract  as  a  cylindrical  tube.  The  speech  signal  is 
represented  using  the  first  kind  and  first  order  Bessel  functions,  Ji(t),  as  the  basis  functions  for 
expansion.  This  representation  is  called  a  Fourier-Bessel  (FB)  expansion. 

The  FB  expansion  of  the  speech  signal  is  achieved  by  using  Ji(amt)  as  basis  functions  of 
representation,  where  =  ^  ,  tm  is  the  m*'*  root  of  Ji{i)  =  0,  and  A  is  the  duration  of  the  time 
frame  under  analysis.  The  decomposition  describes  a  speech  signal  as  a  linear  combination  of  the 
orthogonal  basis  functions 


oo 

m=l 


(2.10) 
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The  set  {Ji(a«nt)}  is  orthogonal  with  respect  to  the  weighting  function  t,  and  the  Cm  coefficients 
in  equation  2.10  are  given  by 

=  (A2(^(t„))2]  I  (2.11) 

By  taking  the  Fourier  Transform  of  Ji(amt),  Gurgen  and  Chen  showed  that  the  FB  series  behaves 
like  a  low-pass  filter.  By  using  the  magnitude  and  the  phase  spectrum,  it  is  possible  to  calculate 
the  maximum  frequency  achieved  with  the  number  of  the  roots  of  Ji(amt)  as  (11) 

Z™.  =  (2.12) 

2.2.2.2  Noise  suppression  using  FB  expansion.  Just  like  Boll’s  method,  the  speech 
signal  a{t)  is  assumed  to  be  degraded  by  uncorrelated  additive  noise  n{t)  where  the  noisy  speech 
signal  y{t)  is  given  by 

y(t)  =  a{t)  -I-  n(t).  (2.13) 

Taking  the  FB  expansion  of  the  above  signal,  we  get 

ym  ~  *m  d"  (2-14) 

where  m  =  1,2,3,.... 

Experimentally,  Gurgen  and  Chen  showed  that  the  FB  coefficients  representation,  vdth  up  to 
150  coefficients  and  10ms  analysis  frame  ,  introduces  a  low-pass  filtering  effect  on  the  speech  signal 
by  attenuating  its  high-frequency  region.  Therefore,  the  noise  which  is  assumed  to  contain  most  of 
the  high  frequency  components,  can  be  suppressed  by  using  an  appropriate  number  of  coefficients 
in  the  synthesis  of  the  signal  (11). 
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Since  ym  is  known  (raw  data),  if  we  can  obtain  a  satisfactory  estimate  of  the  noise  level  and  calculate 
its  FB  expansion  we  can  get  an  estimate  of  the  enhanced  speech  signal  as 

Sm  “  Vm  ^m- 

The  estimation  of  the  noise  is  based  on  two  different  techniques,  the  single-microphone  case 
and  the  two-microphone  case.  In  the  single-microphone  case,  the  noise  estimate  is  accomplished 
by  detecting  the  speech/non-speech  intervals  using  energy  thresholds  to  locate  the  silence  intervals 
where  the  energy  of  the  noise  can  be  estimated.  In  the  two-microphone  case,  a  reference  microphone 
path  is  used  to  estimate  the  noise  and  calculate  its  FB  coefficients.  A  primary  microphone  path  is 
used  to  calculate  the  FB  coefficients  of  the  noisy  speech.  The  difference  between  these  two  paths 
is  used  to  estimate  the  FB  coefficients  of  the  enhanced  signal  (11). 

2.2.3  Adaptive  Noise  Reduction  Using  Discrimination  Functions.  Most  speech  enhance¬ 
ment  techniques  (e.g.,  spectral  subtraction  by  S.  Boll)  are  based  on  using  speech  detectors  to  locate 
the  non-speech  activities  in  a  speech  signal  and  use  that  information  to  estimate  noise.  The  quality 
of  the  results  depends  heavily  on  the  quality  of  the  speech  detectors  used  in  the  analysis.  The 
Discrimination  Function  Minimization  (DFM)  method  does  not  use  a  speech  detector  and  does  not 
assume  stationarity  of  the  noise  over  an  entire  speech  period  .  The  purpose  of  the  DFM  is  to  define 
a  function  that  differentiates  between  clean  and  noisy  speech  signals  in  order  to  reduce  the  noise 
in  the  noisy  speech  signal  (10).  Based  on  essential  features  of  speech  and  ambient  noise,  the  DFM 
uses  a  single-microphone  adaptive  filtering  approach  and  minimizes  a  mean  square  error  function. 

2.2.3. 1  Discrimination  Function  Minimization  (DFM).  The  DFM  technique  in¬ 
volves  two  steps: 

1.  Definition  of  a  Discrimination  Function  J{x) 
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The  discrimination  function  J(x)  is  defined  for  a  vector  x  =  such  that 

•^(x)UesnAr  <  Ji^)\xeN,  (2-16) 

where 

a.  5  =  {s},  set  of  segments  of  clean  speech. 

h.  N  =  {n},  set  of  segments  of  noise. 

c.  SnN={y  =  s  +  n},  set  of  noisy  speech  segments. 

The  above  equation  states  that  the  value  of  J  for  pure  noise  signals  is  greater  than  that  of  speech 
and  noise  signals. 

2.  Filtering  or  suppression  of  the  noise  based  on  setting  the  coefficients  of  the  filter  h 
such  that  J  is  minimized  (see  figure  2.2). 


Figure  2.2  Block  Diagram  C  The  DFM  Noise  Reduction 

2.2.3. 2  Example  Of  A  Discrimination  Function  Since  the  rate  of  change 

of  the  noise  parameters  (e.g.,  autocorrelation)  are  less  than  those  of  speech  signals,  the  authors 
concluded  that  it  is  possible  to  derive  some  discrimination  functions  directly  from  the  dynam¬ 
ics  of  the  speech  sample-variance.  Experimentally,  the  authors  found  that  the  rate  of  change. 
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Av(t),  of  a  speech  frame  variances  0(1)  and  the  duration  of  the  so  called  stationary  periods,  where 
Ai;(t)  <  AvThrta  two  discriminating  features  that  can  be  used  to  filter  the  noise  out  of  a  noisy 
speech  signad  (10). 

Let  s  be  a  clean  speech  segment  degraded  by  uncorrelated  additive  noise  n,  the  noisy  speech 
signal  y  is  given  by 


y  =  s  +  n,  (2.17) 

and  define  Nr  be  a  subset  of  the  noise  set  N  such  that,  the  length  cf  each  vector  in  Nr  is  longer 
than  Tmaai  the  maximum  length  of  a  stationary  period  in  a  speech  signal  (i.e.,  Tmo*  =  200ms).  The 
discrimination  function  is  then  defined  as 


•7fi(x)|xgsniv  <  ^fl(x)|xeW/n 


(2.18) 


where  Nr  C  N.  Let  Jr(x)  be  a  discrimination  function  defined  over  a  data  frame  of  length  N. 
The  sample-variances  are  calculated  for  each  sub-frame  of  length  L.  These  sub-frames  are  non 
overlapping  and,  therefore,  we  have  p  =  ^  sub-frames. 

The  sample  variance  for  each  sub-frame  is  defined  as 


tr(z)  = 


1 


L-l 


WsHiL-k), 


k=0 


(2.19) 


where  t  =  1,2, ...p  and  s{j)  is  the  filtered  signal  at  time  j,  which  is  calculated  using  the  input 
vector  y  and  a  transversal  filter  h  of  order  M  such  that: 


Af-l 

»(j)  =  13  ~ 

n=0 


(2.20) 
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for  j  =  0, 1,  ..N  —  1.  Now  define  the  absolute  value  of  the  relative  change  of  the  variance  as 


<T{i)  -  a(i  -  1) 
a(i  - 1) 


(2.21) 


and  an  exponential  weighting  factor  as 


w(i)  =  r*. 


(2.22) 


where  0  <  r  <  1,  and  i  =  1, 2, . . .  p. 

Using  the  above  definitions,  we  can  define  two  discrimination  functions: 

1.  A  first  discrimination  function  that  maximizes  the  relative  changes  of  variance 

defined  as 


Jmix)  = 


w{k)Av(p  —  k) 


(2.23) 


2.  A  second  discrimination  function  that  minimizes  the  durations  of  the  stationary 
periods  under  analysis  defined  as 

p-i 

J’r2(x)  =  XI  (2-24) 

fc=o 


where  e(t)  is  defined  as 


{T  -  Tmax  if  T  >  Tmax  and  Av(i)  < 
0  otherwise. 


AVThr 


(2.25) 
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where  for  »  =  1,2,...  p,  the  value  of  e(t)  is  the  excess  time  beyond  the  frames  period  Tmaz- 
The  entire  discrimination  function  can  be  defined  as 

=  Cl  Jhi(*)  +  C2  JiB(x),  (2.26) 

where  ci  and  C2  are  normalizing  factors.  The  minimization  of  Jr  in  order  to  find  the  coefiicients 
of  the  filter  h  has  two  consequences:  Jri  maximizes  the  relative  changes  of  the  variances  and, 
according  to  equation  2.25,  Jr2  minimizes  the  durations  of  the  stationary  periods  (10). 

The  accuracy  of  the  DFM  method  depends  heavily  on  the  validity  of  the  discrimination 
function.  Besides  the  fact  that  the  DFM  does  not  require  a  speech  activity  detector,  the  main 
advantage  of  the  DFM  is  that  the  filter  h  adapts  to  the  changes  of  the  noise  patterns  throughout 
the  speech  signal. 

£.£.4  Other  Speech  Enhancement  Techniques.  Many  speech  processing  researchers  model 
speech  as  a  sum  of  sinusoidal  periodic  functions.  Kobatake,  Karou,  and  Sheng  approached  the 
speech  enhancement  problem  by  means  of  the  maximum  likelihood  estimation  (MLE).  The  authors 
segmented  the  speech  signal  into  frames  and  sub-frames  and  then,  by  maximizing  an  a  posteriori 
probability  density  function,  they  estimated  the  Fourier  coefficients  of  the  voiced  portions  at  a 
specific  frame  (15). 

In  1989,  Nadeem  A.  Bashir,  a  graduate  student  at  the  Air  Force  Institute  Of  Technology 
(AFIT),  developed  a  system  in  order  to  enhance  the  quality  of  mutilated  speech.  His  technique 
analyses  the  damaged  speech  in  the  Fourier  dommn  and  then,  based  on  known  properties  of  normal 
speech,  such  as  periodicity  of  voiced  speech,  a  computer  program  generates  a  set  of  sinusoids  whose 
amplitudes  and  phases  are  derived  directly  from  the  speech  signal  itself.  These  sinusoids  are  used 
to  reconstruct  a  cleaner  and  clearer  version  of  the  mutilated  speech  (13). 


III.  Stein’s  Criteria,  Wavelet,  And  Fourier  Theory 


3.1  Introduction 

In  this  chapter,  we  present  three  midn  topics:  Stein’s  criteria,  wavelets,  and  Fourier  analysis. 
Stein’s  criteria  defines  both  the  necessary  conditions  to  estimate  the  mean  of  an  independent  normal 
random  vector,  as  well  as  a  method  for  estimating  the  risk  associated  with  the  mean  estimation 
technique.  Next,  we  present  the  necessary  wavelet  theory  and  how  it  can  be  related  to  Stein’s 
criteria.  In  fact,  we  will  prove  that  the  wavelet  coefficients  of  an  independent  normal  random 
vector,  are  themselves  independent  and  normal.  This  property  of  the  wavelet  coefficients  makes 
them  candidates  to  use  with  Stein’s  criteria.  Finally,  we  present  the  Fourier  transform  and  some  of 
its  properties  and  we  will  prove  that  the  Fourier  coefficients  (with  some  restrictions  to  be  discussed 
later)  can  be  used  with  Stein’s  criteria. 

Using  the  theory  of  Stein,  we  present  two  different  thresholding  techniques,  the  hard  thresh¬ 
olding  technique  (HTT)  and  the  soft  thresholding  technique  (STT).  These  thresholding  techniques 
will  be  used  in  our  experiments  dealing  with  de-noising  speech.  Throughout  this  chapter,  we  will 
assume  that  all  random  vectors  are  independent,  normal,  and  have  the  same  variance. 

3.B  Stein’s  Unbiased  Estimate  Of  Risk  (SURE) 

Given  a  normal  random  vector,  X  =  (Xo,Xx,X2,...  ,Xjv-i)>  whose  elements  ,Xi,  are  inde¬ 
pendent  normal  random  variables  with  arbitrary  means  and  the  same  variance  such  that  for 
i  =  0,l,2,...,N-l 


Xi  ^  (3.1) 

Charles  Stein,  a  statistician  at  Stanford  University,  introduced  a  simple  equation  (SURE)  to 
estimate  the  error  associated  with  the  estimation  ji  =  (AoiAi^Az'"- iAn-i)  of  the  true  mean. 
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p  ss  (hq,  y  hn-\ )«  of  the  normal  random  vector  it  by 


Jt  =  it^g(it),  (3.2) 

where  g  :  is  an  almost  diflferentiable  function  to  be  defined  later  (20). 

Stein’s  theory  can  be  used  with  any  normal  random  vector  with  independent  random  variables 
whose  variances  are  identical.  The  next  sections  provide  a  detsdled  derivation  of  Stein’s  error 
equation  which  we  will  use  with  both  wavelets  and  Fourier.  Stein  developed  his  criteria  by  first 
deriving  the  basic  equations  for  a  standard  normal  random  variable  (zero  mean  and  variance  of 
one)  and  then,  he  extended  the  results  to  the  case  of  several  arbitrary  normal  random  variables 
with  the  same  variance.  It  is  important  to  understand  that  all  the  normal  random  variables  are 
assumed  to  be  independent  and  have  the  same  variance  with  an  arbitrary  mean. 

3.2.1  Standard  NormtJ  Distribution:  X  ~  N{0, 1).  Let  A*  be  a  real  random  variable  with 
a  standard  normal  distribution 


The  derivative  of  the  above  probability  density  function  (pdf)  is 

0'(i)  =  -x4>{x). 


(3.3) 


(3.4) 


and  let  y  be  an  indefinite  integral  of  the  Lebesgue  measurable  function  g’  such  that 


g  :  R  — *  R, 


(3.5) 
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and 


e{|»'(A-)|}  <  oo. 


(3.6) 


where  E  is  the  expectation  operator  defined  by 


/oc 

z^(x) dz, 

*00 


(3.7) 


and  g'  is  the  derivative  of  the  function  g.  We  shall  show  that 


E[g'{X)]  =B[A:ff(X)]. 


(3.8) 


First  of  all,  we  have  the  following  identities  concerning  the  standard  normal  distribution 


Hx)  =  f  <f>'{z)dz 

=  /  —z<f>{z)dz. 

J^oc> 


(3.9) 


Since  0(z)  =  and  ^'(z)  =  -z^(z),  we  have  the  following  relations 


^'(-z)  =  z^(-z) 

=  z^(z) 

=  -^'(x),  (3.10) 


and  we  can  then  write 


0(z)  = 


0(-z) 
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f  ~^*{z)dz 

«/  — OO 

f  z^(z)dz 
J  —  oo 

j  -z^(-z)d(-z) 
J  —  OO 

,oo 

/  z^{z)  dz. 


Using  the  above  equalities,  we  get 


=  /  g'(x)<l>{x)dx 

=  /  g'{x)<l>{x)dx+  f  g'{x)<l>{x)dx 

=  j  g'(x)  f  -z^{z)dzdx  +  f  g'{x)  j  z<l>{z)dzdx. 
^  — OO  J  —  OO  Jo  Jz 


Using  Fubini’s  theorem  (2),  we  can  switch  the  order  of  integration  and  get 

E[ff'(X)]  =  -  f  z<t>{z)  I  s^{x)dxdz+  f  z<t>{z)  f  g'{x)dxdz 

J-oo  Jx  Jo  Jo 

=  f  z(^(z)  f  g'{x)dxdz+  j  z^{z)  j  g'(x)dxdz 

J—OO  Jo  Jo  Jo 

=  f  z<p{z}  f  g'{x)dxdz 

J-oo  Jo 

=  f  z<t>{z){g{z)-g{0))dz 

J  —  OO 

=  f  zif>{z)g{z)dz-  f  z(f>(z)g(0)dz 
J  —  OO  J—OO 

=  /  z^(z)g{z)dz 

J  —  OO 

=  /  x4>{x)g{x)dx 

J  —  OO 

=  ^Xg(X)] 


(3.11) 


(3.12) 


(3.13) 
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S.2.2  Arbitrary  Normal  Distribution:  Y  N(ft,(T^).  Using  the  results  of  the  last  section, 
we  will  extend  equation  3.13  to  the  case  of  an  arbitrary  normal  random  vziriable.  The  results  of 
this  section  will  be  used  in  the  general  case  of  a  normal  rsindom  vector  whose  components  are 
independent  normal  variables  with  the  same  variance  and  arbitrary  mean. 

Let  K  be  a  real  random  variable  with  an  arbitrary  normal  distribution.  Since  Y  ~ 
the  random  variable  X  =  has  a  standard  normal  dbtribution  (i.e,  X  ~  N{0, 1)).  Define 

h  :  R  — ►  R  such  that 


/i(y) 


(y-M) 


where  g  is  defined  by  equation  3.5.  We  shall  derive  a  formula  for  E[/i'(y)]. 


B[fc'(y)] 


Ib 

a 


-E 

<7 


Xg[X] 


B 


<7 


{Y-A 


II 


h(y) 


(3.14) 


(3.15) 


3.2.3  Generalized  Formulas  For  A  Midtivariate  Normal  Distribution.  The  formulas  we 
derived  for  the  single  normal  random  variables  can  be  generalized  to  the  case  of  a  normal  random 
vector  in  which  each  element  is  an  independent  normal  random  variable  with  the  same  variance  <t^. 
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3.2.3. 1  Multidimenaional  Definitions  And  Notations.  Letjt  =  {Xo,Xi,X2,. . .  ,Xs-i) 
be  a  normal  random  vector  in  which  each  element  JCj  is  an  independent  normal  random  variable 
such  that  for  i  =  0, 1, 2, . . . ,  iV  -  1 

Xi  ~ 

The  mean  of  the  vector  Jt  is  defined  as 

p  =  (3.16) 

The  energy  of  the  normal  random  vector  is  defined  as 

ll^ll"  =  f^Xf.  (3.17) 

<=0 

A  function  h  :  — *  R  is  called  almost  differentiable  if  there  exists  a  function  Vfc  :  R^  — ►  R^ 

such  that,  for  all  £  €  R^ 


h{x  +  z)-h{x)  =  f  z-Vh(x  +  tz)dt,  (3.18) 

Jo 

for  almost  all  x  €  R^.  A  function  g  :  R^  — »  R^  is  called  almost  differentiable  if  all  its  coordinates 
are.  The  symbol  V  is  the  vector  differential  operator  of  first  partial  derivatives  with  i*^  coordinate 


a 


dxi' 


so  that 


Vi/i(x) 

Vfc(x) 


dh{x) 
dxi  ’ 

fdh(S)  dhji)  dh{i)  \ 

V  axo  ’  axi  ’"'’axjv-iJ' 


(3.19) 

(3.20) 
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S.2.3.2  Basic  Formulas  For  An  Arbitrary  Normal  Multidimensional  Random  Vari<d»le. 
Let  Jl  be  the  multidimensional  normal  random  variable  defined  in  the  previous  section  and  h  : 
— ►  R  an  almost  differentiable  function  such  that 


00, 


(3.21) 


Since  each  component  Xi  (for  i  =  0, 1, 2, . . . ,  iV  - 1)  is  an  independent  normal  random  variable  and 
Xi  ~  iV(p<,<r*),  we  can  write 


(3.23) 


3.2.4  Closed  Form  Of  Stein’s  Error  Function.  Given  a  multidimensional  normal  vector 
it,  composed  of  independent  normal  random  variables  Xi  ~  N{ni,(T^)  for  i  =  0, 1,2,...  ,JV  -  1, 
Stein  defined  an  estimate  ^  =  (/io,Ai,...  ,/ijv_i)  of  the  true  mean /!=  ,/*w-i)  as  follows 


p  =  it  +  giH),  (3.24) 

wherep  :  R**  — »  R*^  is  an  almost  differentiable  function  with  coordinates  5(.^)  =  (po(.^)i5i(.^)?  •  •  •  >ffiv-i(.^)) 
such  that 

j,.  :  rN  _  R, 
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and 


where  the  subscript  n  indicates  the  dependence  of  the  expectation  operator  on  the  mean. 

For  each  normal  random  variable  Xi,  Charles  Stein,  defined  an  unbiased  estimate  of  the  risk 
(SURE)  associated  with  estimating  the  true  mean  m  of  the  single  independent  normal  random 
variable  Xi  as  the  expected  squared  error  between  the  estimate  fii  and  the  true  mean  fii  as  follows 

=  E^  [{Xi  -  ^Li)^  +  2gi{X){Xi  -  ^i)  + 

=  E^  [(Jfi  -  +  E^  [ff?(:?)]  +  2E^  [giiJim  -  Mi)] . 

(3.25) 


Since 


E^  (Xi-Mi)ffi(^)]  =  , 


(3.26) 


equation  3.25  becomes 


E^[(/ii-Mi)^ 


=  <T='+E^[<^(je)]+2«r%  .  (3.27) 


Using  the  above  equation  for  a  single  random  variable,  Charles  Stein  defined  an  unbiased 
estimate  of  the  risk  associated  with  estimating  the  mean  p  of  the  vector  jl  as  follows 


Em[||A-/Z1I"]  =  f;E4(A*-Mi)*] 

i=l 
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(3.28) 


=  N<r'+E„[||j(X)f]+2»»E„[vs(i)]. 

Ideally,  we  want  to  minimize  the  risk  defined  by  equation  3.28  in  order  to  get  a  more  accurate 
estimate  of  the  mean.  Since  this  equation  depends  on  the  choice  of  the  function  g,  many  different 
choices,  which  satisfy  the  differentiability  conditions  stated  above,  are  available.  Since  the  basic 
estimation  technique  is  based  on  adding  a  value  to  each  element  of  the  random  vector  the  next 
section  introduces  two  different  choices  of  the  function  g.  These  choices  have  a  lot  of  practical 
applications  and  can  be  used  to  de-noise  signals  degraded  by  additive  white  Gaussian  noise.  In 
particular,  the  theory  of  Stein,  proves  that  for  white  Gaussian  noise  with  zero  mean  and  a  variance 
of  <r^,  the  mean  estimate  using  Stein’s  criteria,  is  theoretically,  zero.  In  other  words,  when  we  input 
zero  mean  white  Gaussian  noise  signal  to  a  Stein  based  mean  estimator,  we  expect  the  output  signal 
to  be  zero.  This  observation  can  be  used  to  de-noise  signals  corrupted  by  additive  white  Gaussian 
noise  with  zero  mean  and  a  variance  of  <r*. 

3.3  Soft  Thresholding  Technique 

Let  X  be  a  multidimensional  normal  random  vector  whose  elements  are  independent  normal 
random  variables  with  the  same  variance  cr^  and  let  its  mean  be  the  vector  /I  =  (/loi Mi i 
Define  an  estimate  of  the  mean  m  hy  m  =  (mo>Mii-**  ^ Ajv-i)  such  that  (5)  (6)  (9)  (8)  (7) 

li  =  X  +  g{^), 

where  g(X)  =  (go(X),gi(X), . . .  ,gjv-i(X))  is  as  defined  in  equation  3.24. 
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Figure  3.1  Soft  thresholding  technique  (STT). 


The  Soft  Thresholding  Technique  (STT)  uses  a  threshold  {t  >  0)  to  estimate  the  true  mean, 
Hi,  of  each  normcil  random  variable,  Xi,  by  the  estimate  /ij,  defined  by  (see  figure  3.1) 


H\=Xi+gl{jt), 


where  for  each  i  =  0, 1, 2, . . . ,  A"  -  1 


('-fsgn(Xi)  |Xi|>« 

gli^)  =  {  (3.29) 

[  -Xi  |Xi|  <  t. 
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This  yields 


0  \Xi\  <  t. 


(3.30) 


An  alternative  representation  of  the  soft  thresholding  technique  is  obtained  by  use  of  the  minimum 
operator  to  write 


^  _  n,in(|Xi|  ,t)  sgn(Xi).  (3.31) 

Then,  for  soft  thresholding,  the  mean  estimate  is  defined  as 

ii\  =  Xi-  mm(|Xi|  ,t)  8gn(X<).  (3.32) 


3.3.0. 1  Definition  of  The  Soft  SURE  Function.  Since  gl{X)  is  almost  differentiable, 
we  may  write 


dXi 


0  \Xi\  >  t 
-1 


By  using  the  characteristic  function  which  is  defined  by 


(-^t) 


0  \Xi\>t 
1  \Xi\<t, 


we  get 


dgKJt) 

dXi 


-X[-...](X<). 


(3.33) 


(3.34) 


(3.35) 
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We  conclude  then  that 


w-i 

=  -X;X[-Ml(^i)-  (3.36) 

»=o 

Since 

»=0 

=  53[min(|X,|,t)]  ,  (3.37) 

»=o 


combining  equations  3.24,  3.36,  and  3.37  together,  Donoho  and  Johnstone  (5)  (6)  (9)  obtained  the 
following: 


SURE»oft{t 


+  X)  ^)1  -  2*^*  [  X  ^[-‘-‘1  (-^i)] 

«=0  t=0 


(3.38) 


equation  3.28  becomes: 


E^[llA-/J|P]  =  E.J^SVRE,ofi{t,X)\  (3.39) 

3.3. 0.2  Soft  Threshold.  Since  we  want  to  minimize  the  estimate  of  the  error  associ¬ 
ated  with  estimating  the  mean  fi,  we  need  to  choose  a  threshold  that  minimizes  the  SU REtoft 

quantity  defined  by  equation  3.38.  In  order  to  choose  the  right  threshold  we  need  to  proceed  as 
follows.  Assume  that  the  coordinates  Xi  of  the  vector  have  been  ordered  in  an  ascending  manner 
by  absolute  value  such  that: 


|Ao|<|A-i|<...<lXjv-i|, 


(3.40) 
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and  let  t  >  0  be  an  arbitrary  threshold  such  that  for  some  t  =  0, 1, 2, . . . ,  iV  -  1 


We  have 


JV-l  • 

E 

min(|A’j|,t -1- At) 

J=0 

20* 

J 

N-l 

x[-(t+Ai),«+Af] 

W-l 

E 

i=0 

[min(|Jfjl,t-|- At)] 

i=o  L 

i  r 

E 

L  J 

jmin(|A’j|,f  -1-  At)| 

i=oL 

N~1 

E 

L  J 

min(|X;|,t  -b  At' 

;=»+l 

N-i 

E 

min(|Jfj|,t  4-  At] 

i=i+l 

W-1 

• 

i=«+i 

N~l 

=  ^  [(2t  +  At)At] 

j=i+i 

>  0, 


(3.41) 


which  means  that 


SURE,oft{t  +  At,ji)  >  SURE,oft{t,:^)  >  SURE,oft{\Xi\,:^). 


We  conclude  then  that  in  order  to  choose  a  threshold  that  minimizes  the  SURE  so  ft  quantity,  we 
need  only  test  thresholds  that  are  elements  of  the  known  set 
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The  domain  for  our  soft  threshold  is  then  defined  as 


The  value  0  is  included  in  order  to  take  care  of  the  cases  where  il  =  p  and  =  0.  The  threshold 
that  minimizes  the  SU REgoft  quantity  will  be  denoted  by 


=  arg 


min[5£fii£,„/t(t,^)]  , 


where  t  €  {O}  U 


(3.42) 


3-4  Hard  Thresholding  Technique 

Just  like  the  Soft  Thresholding  Technique  (STT),  the  Hard  Thresholding  Technique  (HTT) 
uses  a  threshold  (t  >  0)  to  estimate  the  true  mean,  /tj,  of  each  independent  normal  random  variable, 
Xi,  by  the  estimate  p\,  defined  by  (see  figure  3.2) 


where  for  each  i  =  0, 1, 2, . . . ,  iV  —  1 


fO  \Xi\>t 

ff‘(^)  =  <  (3.43) 

[-Xi  |Xil<t. 


This  yields 


(Xi  |Jir<|>t 

Pi  =  <  (3.44) 

[  0  \Xi\  <  t. 
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Figure  3.2  Hard  thresholding  technique  (HTT). 


An  alternative  representation  of  the  hard  thresholding  technique  (HTT)  is  obtained  by  use 
of  the  characteristic  function  defined  by  equation  3.34,  such  that 

a!  =  -X’i  (A'i) .  (3.45) 

Then,  for  hard  thresholding,  the  gl  function,  is  defined  as 

^..aard(^)  ^  (3.46) 
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S.4.0.S  Definition  of  The  Hard  SURE  Function.  Although  the  hard  thresholding 
function,  is  not  almost  differentiable,  we  decided  to  use  it  with  Stein’s  criteria  in  order  to 

compare  the  results  with  the  soft  thresholding  technique.  We  may  then  write 


axi 


0  \Xi\  >  t 
-1  \Xi\<t, 


(3.47) 


by  using  the  characteristic  function,  we  have 


aXi 


(3.48) 


We  conclude  then  that 


Vl7‘(-^) 


»=o 


(3.49) 


Since 


l|g' 


t=0 

JV-1, 


n  —i  ’ 

=  E  ^ 

»=o 


■X[-M](^i) 


(3.50) 


combining  equations  3.28,  3.49,  and  3.50  together,  we  can  define  the  following  quantity 


N-l 


SUREnardit,  jt)  =  [Wcr^]  +  '  X[-.,t](Jfi)]]  "  [  E  X[-t.tj(X<)] , 

'■  t=0 


(3.51) 


equation  3.28  becomes: 


(3.52) 
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Just  like  the  case  of  the  soft  threshold,  the  domain  of  the  hard  threshold  is  given  by: 


and  the  hard  threshold  should  be  chosen  such  that  the  SU REhard  is  minimized 


AHard 


=  arg  min[Sff^fcar<i{f,-^)]j, 


(3.53) 


N-l 


where  t  €  {O}  U 


S.5  Wavelet  Tranaform 

The  continuous  wavelet  transform  (CWT)  is  a  technique  that  decomposes  and  analyzes  a  finite 
energy  signal,  f{t)  €  (set  of  Lebesgue-measurable  functions)  ,  using  different  resolutions  for 

different  scales  (4),  where 


L*(R)  =  {/ 1 |/(t)|2  dt  <  oo}.  (3.54) 

The  CWT  is  based  on  defining  a  “mother  wavelet”,  which  is  subject  to  the  following  condition 
of  admissibility: 


00, 


(3.55) 


where  ^  is  the  Fourier  transform  of  ij;.  This  condition  implies  that  tj)  decays  to  zero  as  the  frequency 
goes  to  infinity;  furthermore,  it  implies  that  the  mother  wavelet,  is  zero-mean: 


fp{t)dt  =  0. 


(3.56) 
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Since  equation  3.55  requires  that  the  Fourier  transform  of  ^(t)  at  the  zero  frequency  (t.e.,  u*  =  0) 
is  zero 

t^(0)  =  0,  (3.57) 

it  is  clear  that  ^  represents  a  band- pass  filter  (see  figures  B.l  through  B.6  for  three  different 
wavelets). 

Based  on  the  above  conditions,  the  continuous  wavelet  transform  with  scale  a  and  shift  b,  is 
defined  as 


/+00 

. 

■OO 


(3.58) 


where  (a,  b)  6  R"*"  x  R  and 


V’a,6(t)  =  a  '  (3.59) 

and  the  asterisk  indicates  complex  conjugation.  The  families  of  functions  ^a,6  define  a  basis  for  the 
family  of  finite  energy  functions  L^(R). 

3.5.1  Properties  of  The  Wavelet  Transform.  The  following  properties  of  the  wavelet 
transform  wiU  prove  very  useful  in  our  future  derivations  of  the  discrete  wavelet  transform  (DWT) 
and  the  extension  of  the  thresholding  techniques  to  the  wavelet  domain. 

Linearity:  V  a,  /?  €  R 


W»-‘  [a/(f )  -b  Pg{t)]  =  a>V“-‘  [/(t)]  -I-  [g{t)] 


(3.60) 


Scaling:  V  A  6  R  -  {0} 
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|^(/)|*,  where  /  represents  frequency. 


/o  =  jr'*/IW)P4f.  (3.66) 

2.  The  second  Moment  or  the  variance,  cr^,  of  this  wavelet  based  pdf  is  then  defined  as 

r*(/-/o)*iW)l*d/-  (3.67) 

Jo 

The  value  of  this  variance  measures  the  dispersion  of  frequencies  relative  to  the  mean  fo.  The  larger 
the  variance,  the  more  dispersed  are  the  frequencies  relative  to  the  mean.  This  also  means  that 
the  passband  of  the  wavelet  is  larger  with  a  wider  bandwidth.  The  center  frequency  of  a  wavelet 
allows  us  to  determine  the  range  of  frequencies  that  are  filtered  at  a  specific  resolution  level  “a”. 

3.5.3  Resolution  Properties  Of  The  families  of  Wavelets  V'o,».  The  families  of  wavelets 
i>a,h{l)  &re  formed  by  dialations  (using  the  scale  a)  and  translations  (using  the  shift  b)  of  the  mother 
wavelet  tj).  The  admissibility  condition  defined  above  still  bolds  for  these  newly  formed  wavelets. 
Since 

V’a,4(f)  =  a"^^V(^-^),  (3.68) 

these  wavelets  have  an  expected  value  at  time  t  =  b  and  it  can  be  shown  that  their  variance  is  given 
by 

<^{a,b)  =  ^  (3.69) 
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where  at^  is  the  variance  of  the  mother  wavelet. 


1.  The  Fourier  transform  of  >>  8*^°  by 


i>aAf)  = 


where  ^  is  the  Fourier  transform  of  the  mother  wavelet,  defined  by 


^(/) 


/+00 

-OO 


(3.70) 


(3.71) 


and  i  is  the  complex  number  such  that 


=  -1. 


(3.72) 


2.  The  center  frequency,  /a, 6,  of  these  wavelets  is  related  to  the  center  frequency,  /o,  of 
the  mother  wavelet  by  the  following  relation 

/a, 6  =  — .  (3.73) 

a 


3.  The  variance  of  these  wavelets,  o-J  are  then  related  to  the  variance,  o*,  of  the  mother 
wavelet  by  the  following  equation 


(3.74) 


A  moment's  reflection  on  the  above  two  parameters  shows  that  as  the  value  of  the  dilation 
parameter  a  increases,  the  bandpass  center  frequency  ,/a,»,  of  the  wavelet  t^a,»(f)i  approaches  the 
lower  frequencies  near  the  origin,  the  dc  frequency,  with  a  smaller  variance  or  bandwidth,  This 
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shows  that  by  changing  the  value  of  the  dilation  parameter  “a”,  we  can  “zoom  in”  to  different 
frequencies  in  the  spectrum  of  the  signal  /(t). 

3.6  Discrete  Wavelet  Transform 

Since  the  admissibility  condition  defined  above  holds  for  i>a,h{t),  the  families  of  functions 
i>a,hit)y  which  are  formed  by  dialations  (scale  a)  and  translations  (shift  b)  of  the  mother  wavelet 
il>,  are  themselves  wavelets.  They  form  a  basis  for  L^(R).  Since  equation  3.58  represents  an  inner 
product  between  the  function  f(t)  and  the  corresponding  wavelet  V’a,k(t),  the  wavelet  transform  with 
a  particular  choice  of  “a”  and  “6”  is,  indeed,  a  measure  of  the  similarity  between  f(t)  and 
While  these  newly  formed  wavelets  are  a  basis  for  L^(R),  they  are  not  necessarily  orthogonal  and 
may  redundantly  represent  the  signal,  /(t)  (1).  By  discretizing  the  values  of  the  shift  and  scale 
parameters,  it  is  possible  to  find  an  orthonormal  set  of  wavelets  to  represent  functions  in  L^(R).  If 
we  choose  a  =  and  b  =  nbod^  for  some  m,  n  €  Z,  it  is  possible  to  find  an  orthonormal  wavelet 
basis  for  i/^(R).  The  choice  most  commonly  made  is  for  oq  =  2  and  bo  =  where  oo  is  known  as 
the  dilation  factor. 

3.6.1  Multi-resolution  Analysis.  The  Multi-resolution  Analysis  (MRA)  of  a  signal  f{t) 
was  first  introduced  by  Mallat  and  Meyer  in  1986  (16).  The  MRA  decomposes  a  signal  into  a  set  of 
approximations  where  the  orthonormal  wavelet  bases  are  used  as  a  tool  to  describe,  mathematically, 
the  “increment  of  information”  needed  to  go  from  one  coarse  approximation  to  a  finer  or  higher 
resolution  approximation  (4).  Since  the  analysis  of  the  signal  f(i)  is  based  on  a  set  of  orthonormal 
wavelets  which  form  a  basis  for  L^{R),  the  amount  of  information  needed  to  implement  the  MRA 
is  kept  to  a  minimum.  Mallat  developed  a  fast  algorithm  to  implement  the  MRA. 

3. 6. 1.1  MRA  Requirements.  A  multi-resolution  analysis  consists  of  a  set  of  approx¬ 
imation  spaces,  Vj  c  //^(R)  {j  €  Z),  which  satisfy  the  following  six  requirements  (21): 
Requirement  1 
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The  approxim&tion  spaces  Vj  are  embedded  such  that 


...  QViCViCVoC  V-x  C  V_2  C  ... 


(3.75) 


Requirement  2 


Requirement  3 


\JVj  =  L*(R). 

i€Z 


(3.76) 


=  {0}. 

JGZ 


Equation  3.76  ensures  that  V  /  €  I<*(R) 


(3.77) 


Urn  Pjf  =  f, 

►  —  OO 

where  P^/  is  the  orthogonal  projection  of  /(t)  onto  Vj. 

Requirement  4 

The  above  approximation  spaces  must  satisfy 

/(t)  eVj^  f(2H)  €  Vo.  (3.78) 

Equations  3.75  and  3.78  imply  that  all  spaces  of  the  MRA  are  scaled  versions  of  the  central  space 

Vo. 

Requirement  5 

The  central  space  must  be  invariant  under  integer  translations.  Vn  €  Z  we  have 

meVo=>f{t-n)£Vo.  (3.79) 
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Requirement  6 

There  must  exist  a  scaling  function  ^  such  that 


where  Vm,  n  €  Z 


{^o,n}nez  is  an  orthonormal  basis  in  Vb, 


^m,«(z)  =  -  n).  (3.80) 

The  above  equation  implies  that  the  set  {^m,n}n€Z  is  mi  orthonormal  basis  for  the  approximation 
space  Vm,  where  m  €  Z. 

S.6.1.2  Detail  spaces.  To  completely  characterize  the  MRA,  the  above  six  criteria 
can  be  used  to  construct  a  set  of  orthonormal  wavelet  basis  {V'm,n}n,m€2  of  where 

V’m.nC*)  =  2'”*/*^(2"”*i  -  n),  (3.81) 

such  that 

Pm-lf  =  Pmf  +  5^(/,tfrm,n),  (3.82) 

n€Z 

where  Pmf  is  the  orthogonal  projection  of  /  onto  the  approximation  space  Vm  and  {f,il>m,n) 
represents  the  jE/*(R)  inner  product  of  /  and  il>m,n. 

Let  Wm  be  the  orthogonal  complement  of  Vm  in  Vm-i  such  that 
Wm  ±  Vm,  with  Vm  C  Vm-l  and  Wm  C  Vm-L 

The  above  definitions  imply  that  the  orthogonal  projection  of  the  function  /(t)  onto  the  approxima¬ 
tion  space  Vm-i  is  the  same  as  the  orthogonal  projection  of  the  function  /(t)  onto  the  approximation 
space  Vm  plus  the  “information  dilFerence”,  Qmf,  between  the  two  successive  approximations,  Pmf 
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and  Pm-if- 


Qmf  —  Pm~lf  ~  Pmf ■< 


(3.83) 


where  Qmf  €  Wm  and  Qmf  ±  Vm- 

Equation  3.83  implies  that  the  set  {V'm,n}nez  is  an  orthonormal  basis  for  Wm  and  that 

Vm-l  =Vm@  Wm,  (3.84) 

where  ®  designates  the  direct  sum  operator  of  two  linear  spaces.  Furthermore,  the  orthogonal 
complements,  {Wm}m£Z  are  mutually  orthogonal  such  that  for  i  ^  j 

Wi  1  Wj  =  0. 

Since  the  subspaces  { Wm}m6Z  are  mutually  orthogonal,  they  effectively  divide  I*(R)  into  mutually 
orthogonal  subspaces  and  we  have 


®Wm  =  L^R).  (3.85) 

mez 

In  conclusion,  the  set  of  wavelets  {V’m,n}n,m€Z  is  an  orthonormal  basis  for  L*(R). 

3.6.2  Decomposition  and  Reconstruction  of  a  finite  energy  signal  using  DWT.  Let 
f(i)  €  L^(R),  and  denote  the  orthogonal  projection  of  f(t)  onto  the  space  Wm  by  Qmf{t)-  Since 
{V’m,n}n€Z  is  an  orthonormal  basis  for  Wm,  we  can  write  Qmf{t)  as  a  linear  combination  of  the 
discrete  wavelet  series  {V’m,n}n€Z  such  that 


Qmf(t)  —  ^  ]  dm,n^m,n{t), 

neZ 


(3.86) 
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where  dm,n  =  (/,V’m,n)  are  known  as  the  m*'*-level  “detail  coeflSdents”.  Since  {0m,n}nez  is  an 
orthonormal  basis  for  Vm,  the  orthogonal  projection  Pmf{t)  of  f(t)  onto  the  space  Vm  is  defined 
in  a  similar  way  as 

Pmfjt)  ~  ^  ^  Om,n^m,n(f)i  (3.87) 

neZ 

where  Cm,n  =  are  known  as  the  m*^-level  “approximation  coefficients”. 

Consider  the  scaling  function  Since  Vi  C  Vb,  we  can  represent  0i,o(t)  as  a  linear 

combination  of  the  zeroth  level  basis,  {'?io,n(^)}n€Z 


2~^/^^(t/2)  =  ^  hn4t{t  -  n), 
nez 


(3.88) 


where 


hn  =  (^1,0)  ^0,n)- 


(3.89) 


Similarly,  since  Wi  C  Vq  and  {^i,n(*)}n€Z  is  a  basis  for  Wi,  we  can  define 


2  gn<t>{t  -  n), 

nez 

where 


ffn  —  (V’l.Oi  ^0,n). 


(3.90) 


(3.91) 


The  discrete  filters  h„  and  g„  play  a  major  role  in  the  multi-resolution  analysis.  Mallat 
showed  that  the  h  and  g  filters  can  be  used  to  relate  the  approximations  at  the  m‘^-level  to  the 
approximations  and  details  at  the  (m-(-  l)*‘-level,  respectively.  Using  these  filters,  it  can  be  shown 
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that  the  equations  that  relate  the  approximations  and  details  of  different  leveb  are  given  by 


^m,n 

—  13  Cm— l,fch*-2n 

(3.92) 

k€Z 

—  ^  ^  Cm— l,lti^fc-2n- 

(3.93) 

4ez 

The  above  equations  are  the  heart  of  the  MRA  fast  algorithm  that  was  developed  by  MaUat.  Using 
these  equations,  we  can  calculate  the  approximations  of  the  m*^-level  using  both  the  approximations 
and  details  of  the  (m  +  l)**-level,  as  follows. 

After  decomposing  the  approximation  coefficients  at  the  m'^'-level  into  detculs  and  approxi¬ 
mations  at  the  (m-f-  l)**-level,  we  can  perform  the  inverse  procedure  by  using  these  (m-H  l)'‘-level 
approximations  and  details  to  get  back  our  m*^-level  approximations.  In  fact,  the  filters  h  and  g 
may  also  be  used  to  calculate  the  approximations  at  the  m*''-level  starting  with  both  the  approxi¬ 
mations  and  details  of  the  (m  -I-  l)*‘-level  using  the  following  equation 


Cm— l,n  ~  Cm,fthn— 24  "I"  ^  ^  ^m,k9n—2k' 

4€Z  4€Z 


(3.94) 


3.6.3  Characteristics  Of  The  h  and  g  Filters.  Daubechies  (4)  showed  that  the  filters  h 
and  g  have  the  following  properties 


1^1  <  oo 

n€Z 

nez 


(3.95) 

(3.96) 


The  above  two  equations  require  that  the  filters  h  ^lnd  g  must  be  stable. 

Let  H{f)  and  G(/)  represent  the  Fourier  transforms  of  the  filters  h  and  g,  respectively.  A  sufficient 
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condition  for  the  construction  of  the  ^  is  that  the  matrix 

H{f)  GU) 

L^(/  +  |)  G(/  +  i)J 

must  be  unitary  (i.e.,  U^U  =  I,  where  I  is  the  identity  operator). 
One  possible  choice  for  G  is 


(3.97) 


(3.98) 


which  lead  to  the  following  relation  between  the  coefficients  of  the  h  and  g  filters,  'in  £7, 


J7„  =  (-l)<^'’*>hi_„.  (3.99) 

Finally,  The  filters  h  and  g  must  satisfy  the  following  conditions 


=  V2 

(3.100) 

nez 

=  1 

(3.101) 

nez 

9n  =  0 

(3.102) 

nez 

M 

II 

(3.103) 

nez 

Equation  3.100  implies  that  the  h  filter  is  a  low-pass  filter  while  equation  3.102  implies  that  the  g 
filter  is  a  high-pass  filter. 


3.6-4  Examples  Of  Wavelets  And  Filter  Coefficients.  The  following  wavelets  wiU  be  used 
in  our  analysis  of  noisy  speech  data  (chapter  4).  In  tables  A.l  through  A.3,  we  present  the  filter 
coefficients  of  three  different  wavelets,  db6,  coiflet(6),  and  db20.  These  wavelet-based  discrete  filters 
have  different  filtering  properties  (see  figures  A.l,  A.2,  and  A.3).  Observe  that  the  h  filters  are 
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low-pass  filters,  while  the  g  filters  are  high-pass  filters.  Figures  B.l  through  B.6  show  the  wavelets, 
scaling  functions,  and  their  Fourier  transforms.  Observe,  the  amplitude  of  the  Fourier  transform  of 
all  wavelets  represent  band-pass  filters;  while  the  corresponding  scaling  functions  represent  low-pass 
filters.  Notice,  the  wavelets  corresponding  to  db6  and  coiflet(6)  have  many  high  energy  side-lobes; 
while  those  of  the  db20  wavelet,  have  very  small  side-lobes. 

3.7  Implementation  Of  The  Discrete  Wavelet  Transform  (DWT) 

In  order  to  efiiciently  implement  The  MRA  developed  by  S.  Mallat,  we  proceed  as  follows 

(21) 

Given  a  T-periodic  signal  /(t)  such  that  V  t  €  R 

/(t-»-r)  =  /(t),  (3.104) 


the  wavelet  transform  satisfies 

>V“’‘  [f{t  +  T)]  =  [/(t  -I-  T)] ,  (3.105) 

which  means  that  the  continuous  wavelet  transform  of  a  T-periodic  signal,  is  also  T-periodic.  We 
can  use  this  property  to  minimize  the  number  of  calculations  needed  to  decompose  a  given  signal 
into  sets  of  details  and  sets  of  approximations.  The  next  two  sections  use  this  property  to  develop 
an  efiScient  algorithm  for  decomposing  and  reconstructing  a  signal  using  wavelets. 

3.7.1  Decomposition  Using  DWT.  Now,  given  the  filter  sequence  hn  and  N  samples  of 
the  function  f{t),  at  a  s^lmpling  period.  At,  we  compute  the  approximation  coefiBcients,  {cm,n}n€Z 
where  1  <  m  <  M,  for  a  total  of  M  levels  of  decomposition  as 

^m,n  —  2rn  (3.106) 

fcez 
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where  the  zeroth-level  approximation  coefficients  are  taken  to  be  the  samples  of  f{t)  at  integer 
multiples  of  At 

co.n  =  /(nAt). 

Using  equation  3.99,  we  can  calculate  the  gk  filter  sequence.  The  detail  coefficients  are  then 
calculated  using  the  following  equation 


We  can  then  write 


dm,n  — 


Cm— l.tlffc-Zn- 


JbeZ 


(3.107) 


Cfn^n 

(3.108) 

fcez 

=  y^Cm-l,tgj-fc|,-.2n» 

(3.109) 

kez 

where  V  n  €  Z,  the  new  filters  h  and  g  are  defined  as 

hfi  =  h— n  nnd  g^  =  g^n* 

The  above  two  decomposition  equations  may  be  viewed  as  a  two  steps  operation;  A  convo¬ 
lution  of  the  sequence  {cro-i.njnez  with  the  filters  h  and  g,  followed  by  the  operation  of  “down- 
sampling”  by  a  factor  of  2;  i.e.,  the  convolutions  are  evaluated  at  2n,  keeping  only  the  evenly-indexed 
coefficients  of  the  convolution’s  result. 

If  the  filter  h  has  at  most  L  non-zero  elements  and  the  sampled  signal  /„  =  co,n  has  at  most 
N  non-zero  elements,  for  n  =  0, 1, ...,  iV  —  1,  it  can  be  shown  that  the  above  convolutions  of  h  and 
g  with  co,n,  will  have,  in  general,  N  +  L  —  1  non-zero  elements.  The  above  convolution  operations 
“spread”  the  sequences  Cm,n  dm,n-  In  Inct  the  spreading  increases  as  we  move  from  the  m*'* 
to  the  (m  +  l)**-level,  for  m  =  1,2,  ..M.  In  order  to  avoid  this  “spreading”  at  each  stage  of  the 
decomposition,  the  DWT  can  be  implemented  using  a  periodic  extension  of  /  so  that  the  sequence 
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co,n  ’8  iV-periodic 


C0,n+N  =  Co.n- 

Assuming  that  N  —  2**,  where  Af  is  a  positive  integer,  and  due  to  the  down-sampling  operations 
mentioned  above,  the  sequences  Cm,n  and  dm,n  sre  also  periodic  with  period  2~”'N.  We  can  then 
write  the  following  relations  for  m  =  1, 2, . . . ,  M 

~  ®m,{n+2“'*Ar] 

^^m,n  —  <^m,[n+2““JV]- 

Starting  with  N  =  2*^  samples  of  the  original  W-periodic  signal,  the  down-sampled  discrete 
wavelet  transform  (DWT)  allows  a  maximum  of  M  levels  of  decomposition  where  at  each  level  m,  we 
have  exactly  2~”'N  unique  approximation  coeflScients  (cm,n)  and  2~”'N  unique  detail  coefficients 
(<^,n)-  The  last  level  of  decomposition,  the  level  or  the  coarsest  level,  has  one  approximation 
element  and  one  detail  element  (i.e.,  2~**N  =  1).  After  M  levels  of  decomposition,  we  end-up  with 
a  total  of  JV  —  1  unique  approximation  coefficients  and  AT  —  1  unique  detail  coefficients  (see  figure 
3.3). 

To  completely  define  the  above  convolutions,  at  each  level  m,  we  need  only  compute  the 
2~^N  unique  elements.  In  order  to  efficiently  implement  the  above  convolutions,  we  can  rewrite 
the  approximation  coefficients  at  the  decomposition  level  as 

fc. 

^  l,((fc-H2n)mod(2“'"JV)]^l!) 


Figure  3.3  Wavelet  decomposition  of  a  signal  staring  with  N  =  2**  samples  and  decomposing  up 
to  the  m‘^-level  where  1  <  m  <  Jlf . 


where  mod  represent  the  modulo  operator  and  k,  and  ke  represent  the  first  and  last  non-zero 
components  of  the  filter  h,  respectively.  They  are  related  to  the  length,  L  of  the  filter  h  as  follows 


ke  -  kg  =  L  -  1. 


In  a  similar  fashion,  the  detail  coefficients  eire  implemented  as 


k=k. 


Since  =  (-1)^^  the  g  filter  length  is  also  L  and  we  have 


ke-kg=z  L-1. 
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The  first  and  last  non-zero  elements  of  the  filters  h  and  g,  can  be  chosen  so  that  the  filters’  energies 
are  well-centered,  though  not  all  wavelets  have  filters  which  can  be  centered  exactly. 

3.7.2  Reconstruction  Using  DWT.  We  have  seen  that  the  reconstruction  of  the  approxi¬ 
mation  coefficients  at  the  (m  —  l)*‘-level  are  related  to  both  the  approximations  and  details  of  the 
m*^-level  by 


^m— l,n  —  ^  ^  Cm,kh.n—2k  +  ^  \  dm,k9n-2k^  (3.110) 

k€Z  k€Z 

where  for  M  levels  of  decompositions,  m  takes  the  values  m  =  1, 2, . . .  M. 

The  above  equation  can  be  rewritten  as 

l,n  —  ^  ]  ^m,kf^n—k  ^  ^  ^m,k9n—k^ 

kez  kez 

wherein  Cm,k  and  dm,k  represent  the  “up-sampled”  approximation  and  detail  coefficients  at  the 
decomposition  level,  respectively.  V  k  €  Z 

^m,2k  —  f'tn.k  and  Cm,2k+1  —  0 

dm,2fc  —  dm.k  and  drn,2Jk+l  —  0. 

In  order  to  efficiently  implement  the  above  reconstruction  equation,  using  the  periodic  exten¬ 
sion  from  the  last  section,  we  proceed  as  follows; 

Since  we  have  one  unique  approximation  and  one  unique  detail  elements  at  the  M*^-level  (i.e..  The 
Af  ^''-level  is  1-periodic,  we  can  use  the  above  equation  to  compute  the  approximations  at  the  level 
above  (i.e.,  (M  —  l)**-level).  The  number  of  unique  approximation  coefficients  is  where 

N  =  2*^  is  the  number  of  samples  we  started  with.  We  can  then  compute  the  approximations  at  the 
(Af  —  2)"‘*-level  using  this  newly  reconstructed  approximation  set  and  the  details  obtained 

during  the  decomposition  process  at  the  (A/  —  l)**-level.  All  in  all,  for  perfect  reconstruction  of 
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the  sequence  {co,n}o<n<N=2*'  zeroth  level,  we  need  to  keep  the  foUovnng  data 

1.  All  the  details  obtained  during  the  decomposition  process  (a  total  of  iV  —  1  unique  detail 
coefficients). 

2.  The  unique  approximation  coefficient  obtained  during  the  decomposition  process  at  the 
decomposition  level. 

In  conclusion,  starting  with  an  IV  =  2^-periodic  signal,  the  full  DWT  (i.e.,  M  levels  of 
decomposition),  produces  a  total  of  JV  —  1  unique  detail  coefficients,  and  1  unique  approximation 
coefficient  at  the  decomposition  level,  for  a  total  of  N  coefficients.  The  partial  DWT  (i.e., 
m  levels  of  decomposition  where  1  <  tn  <  Af),  produces  a  total  of  N  —  2^^“”*^  unique  detail 
coefficients,  and  unique  approximation  coefficient  at  the  m***  decomposition  level,  for  a 

total  of  N  =  2**  coefficients  (see  figure  3.4). 


Figure  3.4  Wavelet  reconstruction  starting  from  the  m‘^-level  where  1  <  m  <  Af  to  the  zeroth 
level  where  the  number  of  samples  is  N  =  2** 
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The  reconstruction  equation  at  the  m*^-level  can  be  rewritten  as 


k. 

Y^[(n-k  +  l)  mod  2]d„ 

k=k. 

where  0  <  n  < 

S.7.S  Statistical  Properties  Of  The  Wavelet  Coefficients  Of  Random  Variables.  Let  Jl  = 
(Xo,Xi,X2,. . .  ,Xjv_i)  be  a  normal  random  vector  of  N  =  2^  independent  random  variables  such 
that  for  t  =  0, 1, 2, . . . ,  iV  -  1 


Xi^N{pn,a%  (3.111) 

where  pi  =  is  the  vector  mean  of  the  normal  random  vector  We  will 

show  that,  at  each  level  of  decomposition,  the  details  and  approximations  are  also  normal  random 
vectors  such  that  the  discrete  wavelet  decomposition  at  the  m*''-level  (1  <  m  <  M)  is  given  as  in 
equations  3.106  and  3.107  by 


C'niin 

—  ^  ]  C'tn— l,fchi_2n 

fcez 

(3.112) 

—  Cm— 2t»> 

k€Z 

(3.113) 

where  C  and  D  denote  the  approximation  and  detail  random  variables,  respectively.  This  property 
of  the  DWT  coefficients  allows  us  to  use  the  SURE  criteria  which  requires  the  input  data  to 
be  normally  distributed  (see  figures  C.l  through  C.6  for  using  the  STT  technique  with  a  noisy 
sinewave). 
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During  the  decomposition  process,  the  zeroth  level  approximations  are  taken  to  be  the  vector 


jl  itself  such  that 


Co,n  —  -^n 


where  n  =  0, 1,2, . . . ,  iV  —  1.  Since,  according  to  equation  3.111,  the  vector  H  is  normal  and  all 
random  variables,  Xi  ,  are  independent,  the  zeroth-level  approximations  are  also  independent  and 
normally  distributed  with  the  same  parameters  as  the  vector  Jt.  By  using  equation  3.112,  the  first 
level  approximations  can  be  written  as  a  linear  combination  of  the  zeroth  level  approximations  such 
that 


Cl.n  =  Cb,thfc_2n.  (3.114) 

i:€Z 

Since  Co,k  ~  where  =  Hk,  we  conclude  that  Ci,n  is  also  independent  and  normally 

distributed.  The  mean  of  Ci.n,  denoted  by  is  given  by 


Miif. 


=  ElC'i.n] 

=  B  53  ^0,khk-2n 
.k€Z 

=  53'“-2-*Eico,t] 

fc€Z 

=  53  **-2"^* 

fc€Z 

=  53 

fcez 


(3.115) 


Using  equation  3.101  and  the  independence  of  the  zeroth-level  approximations,  Co.nt  Ihe  variance 
is  given  by  (12) 


=  52 

kez 

=  <y*52*k-2n 

fc€Z 

=  <r’.  (3.116) 

The  random  variable  Ci,n  is  then  distributed  as 

^^i.n  ~  N(£  ,  a^).  (3.117) 

k€Z 

Recursively,  the  approximation  coefficients  at  the  m‘^-level  are  also  independent  and  normally 
distributed  such  that  the  mean  is  related  to  the  mean  of  the  (m  -  l)**-level  by 

/*m,n  =  ^2  ^k-2nl*m-l,k‘  (3.118) 

kez 

Using  the  above  results,  we  can  write 

(3.119) 

where  is  defined  by  equation  3.118. 

Since  the  detail  coefficients  are  also  a  linear  combination  of  the  approximation  coefficients 
(see  equation  3.113),  it  is  easy  to  show  that  the  details  at  the  m*^-level  are  also  independent  and 
normal  random  variables  such  that 


(3.120) 


3-37 


where  the  detail’s  means  at  the  m*'‘-level  are  related  to  the  approximation’s  means  at  the  (m  —  1)**- 
level  by 


D 

mffi 


fffc— JnMm— 1,4' 

fcez 


(3.121) 


3.8  Complex  Statistics  and  AntUysis 

The  purpose  of  this  section  is  to  relate  the  statistics  of  a  complex  random  variable  to  the 
statistics  of  its  real  and  imaginary  parts.  The  relations  to  be  developed  in  this  section,  will  be  used 
in  the  analysis  of  the  Fourier  transform  of  normal  random  vector. 

A  complex  number  z  can  be  defined  in  its  rectangular  form  as 

z  =  i  +  *j(,  (3.122) 

where  x  and  y  are  real  numbers  which  represent  the  real  and  imaginary  parts  of  z,  respectively. 
The  complex  number  i  is  as  defined  in  equation  3.72.  The  next  sections,  will  develop  the 

3.8.0. 1  Geometric  Properties  of  Complex  Numbers.  The  amplitude  of  a  complex 
number  is  defined  as 


|z|  =  y/x^+y^.  (3.123) 

When  the  product  xy  ^  0,  the  phase  of  a  complex  number  is  defined  as 

“gW  =  (3.124) 
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where  0  <  <  2ir  and 

8  =  arctan(^). 

(3.125) 

a.  If  z  =  0  (i.e.,  z  =  0  and  y  =  0),  then  the  phase  is  not  defined. 

b.  If  z  =  z  is  pure  real  and  non-zero  (i.e.,  z  ^  0  and  y  =  0),  then 

f  0  z  >  0 

^  = 

(3.126) 

1  ir  z  <  0. 

c.  If  z  =  tj/  is  imaginary  and  non-zero  (i.e.,  z  =  0  and  y  7^  0),  then 

f  f  »>o 

(3.127) 

y<o. 

Using  the  above  properties  of  complex  number  we  can  rewrite  equation  3.122  in  its  polar  form  as 

^  =  \z\e^, 

(3.128) 

where  0  and  c’*  are  defined  by  equations  3.125  and  3.140,  respectively. 

3. 8. 0.2  Statistical  Properties  Of  Complex  Random  Variables. 

A  complex  random 

variable  is  defined  as 

Z  =  X  +  iY, 

(3.129) 

where  both  X  and  Y  are  real  random  variables. 

1.  The  expected  value  of  a  complex  random  variable  is  defined  as 

E[Z]  =  E[X-l-ty] 
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=  E[A']-f»E[y].  (3.130) 

2.  The  variance  of  a  complex  random  variable  is  defined  as; 

Var[^]  =  Var[X  +  *K] 

=  E[\Z\^]-\E[Z]\^ 

=  E[x^  +  Y^]  -  [e[a-]^  +  E[r]^] 

=  [e[A-2]  -  E[A']]  +  [e[Y^]  -  E[Xj] 

=  VarlA-l  +  Varly).  (3.131) 

3. 8.0. 3  Statistics  Of  The  Amplitude  And  Phase  Of  A  Complex  Random  Variable. 
Let  Z  be  a  complex  random  variable  such  that 


Z  =  X  +  iY, 


(3.132) 


where  both  X  and  Y  are  real  independent  normal  random  variables 


X~iV(/i„<r2) 

y~Ar(^,cr2). 


The  amplitude  \Z\  =  VX^  +  Y^,  which  is  a  function  of  the  random  variables  X  2ind  Y,  has  a 
probability  density  function  defined  by 


Hz) 


r  *1 

^^0  [^]e  3-’  z>0 


0 


2  <  0, 


(3.133) 
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where  Mz  =  Mx  +  My  ^o(x)  is  the  modified  Bessel  function  defined  as 


f2* 


(3.134) 


If  /is  =  /iy  =  0,  f{z)  is  cidled  a  Rayleigh  distribution  (18).  The  phase  9  of  the  complex  random 
variable  Z  which  is  defined  as 


$]■ 


$  =  arctan 


where  —ir  <9  <ir,  has  a  uniform  distribution  (18)  in  the  interval  (— x,x)  defined  by 


(3.135) 


/*(«) 


-K  <9  <T 
V  0  otherwise. 


(3.136) 


3.9  Fourier  Analysis 

The  purpose  of  this  section  is  to  define  the  discrete  Fourier  transform  (DFT),  apply  the  results 
of  the  last  section  to  the  real  and  imaginary  parts  of  the  DFT  of  a  random  vector,  and  define  the 
statistics  of  the  real  and  imaginary  parts.  The  results  of  this  section,  will  be  used  with  the  results 
of  Stein  in  order  to  de-noise  the  real  and  imaginary  parts  of  the  DFT  of  a  normal  random  vector. 

Given  a  signal  f{t),  one  is  interested  in  analyzing  its  frequency  content  locally  in  time  (4). 
The  standard  Fourier  transform  which  is  defined  as 

(:F/)(u;)  =  -=/  /(t)e— ‘dt,  (3.137) 

V2ir  J-oo 

gives  a  representation  of  the  frequency  content  of  /(t),  but  it  is  unable  to  localize  frequencies  in 
time.  In  order  to  localize  the  time  occurrence  of  many  high  frequency  bursts,  we  may  first  window 
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the  signal  f{t)  and  then  take  the  Fourier  transform  of  this  windowed  portion  of  the  signal  f{t) 

*)  =  4=  r*  /{«)»(«  -  ds,  (3.138) 

V2ir  y_oo 

where  g(t)  is  a  window  function. 

The  above  equation  is  well  known  in  the  signal  processing  field  by  its  discrete  form,  where 
the  shift  t  and  the  frequency  u  are  discretized  as  t  =  nto  and  u)  =  mwo.  The  Windowed  Fourier 
transform  or  the  short-time  Fourier  transform  (STFT),  can  be  interpreted  as  the 

“amount  of  the  frequency  a>”  present  in  the  signal  /  near  time  t. 

One  similarity  between  the  Fourier  transform  and  the  wavelet  transform  is  that  both  equation 
3.58  and  3.137  take  the  inner  product  of  /  with  a  family  of  functions  indexed  by  two  variables, 
^a,b{t)  =  9^'*  =  —  t).  However,  the  difference  between  the  wavelet  and 

windowed  Fourier  transforms  lies  in  the  shapes  of  the  analyzing  functions  and 

The  functions  all  consist  of  the  same  envelope  function  g,  translated  to  the  proper  time 
location,  and  “filled  in”  with  higher  frequency  oscillations.  The  windowed  Fourier  transform  ef¬ 
fectively  divides  the  frequency  spectrum  of  the  function  f(t)  into  equal-bandwidth  regions.  In 
contrast,  the  windows  used  by  the  wavelet  transform  are  well  adapted  to  their  frequency.  The  use 
of  both  a  dilation  factor  “a”  coupled  with  a  shift  variable  “6”,  allows  the  wavelet  transform  to 
decompose  and  analyze  signals  using  a  small  bandwidth  (broader  window)  for  low  frequencies  and 
large  bandwidth  (narrow  window)  for  higher  frequencies. 

The  main  characteristic  of  the  wavelet  transform  lies  in  its  ability  to  “zoom  in”  and  detect  very 
short-lived  high  frequency  phenomena,  such  as  transients  in  signals  or  discontinuities  in  functions 
(i.e.,  human  vocal  tract  glottal  closure). 
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S.9.1  Discrete  Fourier  Transform  (DFT).  The  discrete  Fourier  transform  (DFT)  of  a 
periodic  finite-length  sequence  of  N  points,  {zm}mZo'  defined  as 

E  ,  (3.139) 

^  fn=0 

where  0  <  fc  <  fV  —  1. 

The  quantity  e'^  is  defined  as 


c‘*  =  cos(d)  -I-*  8in(0),  (3.140) 

where  i  is  the  complex  number  defined  by  equation  3.72.  For  each  0  <  fc  <  JV  - 1,  the  quantity  |xjt| 
measures  the  amount  of  frequency  u)  =  present  in  the  signal  {xm}m=o-  order  to  get  back 

our  original  signal,  {a:m}m=0’  DFT  sequence,  we  perform  the  inverse  discrete 

Fourier  transform  (IDFT)  defined  as 

=  (3.141) 

fc=o 

We  conclude  then  that  the  sequence  {a:m}{m=o,i,2,...,Jv-i}  can  be  represented  as  a  sum  of  sinu¬ 
soids  of  frequencies  0, 1, 2, . . . ,  fV  -  1.  Hence  t*  o  discrete  Fourier  transform  can  also  be  interpreted 
as  a  frequency  analysis  (or  “spectrum  analysis”)  of  the  input  signal  {im}m=  I  (19). 

3.9.2  Properties  Of  The  DFT.  In  this  section,  we  will  show  some  of  the  properties  of  the 
real  and  imaginary  parts  of  the  DFT  of  signal.  We  will  use  these  properties  in  several  occasions 
in  order  to  decrease  the  number  of  calculations  needed  to  implement  the  DFT.  We  will  also  show 
that  some  of  the  DFT  components  (i.e.,  dc  component)  have  very  unique  properties. 
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Using  equation  3.139  and  3.140,  we  can  decompose  the  DFT  into  a  sine  and  cosint  series  as 


follows 


^  jcos(^fcm)+isin(^*m)j 


(3.142) 


FVom  the  above  equation,  the  real  part  is  defined  as 


fTi=o  L  ^  ^  J 


(3.143) 


and  the  imaginary  part  is  defined  as 


Iin[xfc]  =  ^  x™  sin[(^)fcm  . 


(3.144) 


The  elements  of  the  DFT  sequence,  rewritten  as 


it  =  Re  [it]  +ilm[xt]. 


(3.145) 


Assume  that  N  is  even 


a.  The  dc  component  {k  =  0)  is  real: 


1 

=  TfSW 

m=0 

=  ae[xo]. 


(3.146) 
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which  means  that  the  imaginary  part  of  zo  is  zero: 


Im[zo]  =  0. 


b.  ZN  is  real: 


m=0 


which  means  that  the  imaginary  part  of  z^  is  also  zero: 


Im[z^]  =  0. 


(3.147) 


c.  Symmetry:  1  <  fc  <  JV  —  1 


Xk  =  xji-k-  (3.148) 

The  above  equation  has  some  practical  consequences: 

1.  We  need  only  calculate  the  partial  DFT  sequence 

2.  Re  [zfc]  is  even  since  the  cosine  function  is  even. 

3.  Im[zfc]  is  odd  since  the  sme  funciion  is  odd. 

S.9.3  Statistical  Properties  OfThe  DFT  series  Of  Random  Variables.  LetJl  ={Xo,Xi,X2,.  ■  ■  ,Xfr-i) 
be  a  normal  vector  where  iV  is  an  even  number  and  each  element  Xm  is  an  independent  normal 
random  variable  such  that  for  m  =  0, 1, 2, . . . ,  W  —  1 

Xrr,  N{p„,a^). 
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The  mean  of  the  vector  is  defined  as 


ft  =  (3.149) 

Using  equation  3.139,  the  discrete  Fourier  transform  of  is  as  follows 

E  {3.150) 

Similarly,  the  discrete  Fourier  transform  of  /!  is  as  follows 

=  (3.151) 

*  ^  m=0 

where  0  <  fc  <  iV  —  1. 

Since  equation  3.150  represents  a  linear  combination  of  independent  normal  random  variables,  Xu  is 
also  an  independent  normal  complex  random  variable.  Using  the  results  from  the  complex  analysis 
section,  we  have  the  following  statistical  properties  of  the  DFT  complex  random  variable  Xk  (12) 

a.  Mean  of  the  complex  variable  Xk'- 

E[X*]  =  e[f  +iIm[X*]] 

=  E[Re[Afcjj +»E[lm[Xk]].  (3.152) 

1.  Using  equation  3.143,  the  expected  value  of  the  real  part  is; 

E[Re[Xk]]  =  E  ^£\,„cos[(|)fem]jj 

=  :;;^E 

=  (3-153) 
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Using  equation  3.144,  the  expected  value  of  the  ima^ary  part  is: 


We  conclude  then  that 

E[Xfc]=Afc,  (3.155) 

where  p,k  is  k*^  element  of  the  DFT  of  jl  at  the  frequency  k. 
b.  Variance  of  the  complex  variable  Xk‘ 

Var  [Xk]  =  Var  [Re  [Xt]  ]  +  Var  [im  [X*]  ] .  (3. 156) 

1.  Variance  of  the  real  part: 

Var[Re[Xfc]]  =  Var  cos[(^)fcm]j  j 

=  Var[X^]cos2j(^)A:mjj 
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Using  the  following  trigonometric  identity: 


1  8in[(JV-^)a| 

^  28m  f| 


where  a  ^  2q‘K  for  g  €  Z.  and  the  fact  that: 


cos^l^J  = 


coe(2^)  + 1 


we  can  write: 


W-lp  N-l 

H  cos*  [rtf]  = 

r=0  ^  r=0 


cos[2r#]  + 1 


r=0  *•  •* 

j  8in[(J\r-i)20]- 
in  f 


N  1 
“  2  ■*’2 


2"^ 


2  sin 


sm[(iV-i)2tf 

1  J _ L _ : 

sin  [0] 


N  1 
=  T  +  4 


provided  that  0  ^  qir  for  q  £  Z. 

Going  back  to  equation  3.157,  we  can  use  equation  3.159  with: 


2itk 

IT' 


where  0  gr  for  g  €  Z  implies  that  k  ^  0  and  k  ^  y.  The  result  is  as  follows 


2  I  a^  N  1 

E  cos  [re]  =  J  +  - 

r=0  ^ 


1  + 


sin  (^-|)2tf 


sm 


(3.158) 


(3.159) 
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N  1 
7-^4 


N  1 
¥■^4 


N  1 
7  +  4 
N  1 

7  +  4 
N 
2' 


Bin 


1  + 


(Ar-i)2(2j5^) 


sm 


sin 


1  + 


[(¥)] 


1  - 


[x-i] 


The  variance  of  the  real  part  of  Xjt  is  as  follows 
la.  When  k  ^  0  and  ib  ^  y: 


Var[Re[Xfc]]  =  ^  [(^)*”‘]] 


N  2 
2  ■ 


lb.  When  fe  =  0: 


JV-l 


Var[Re[X„]]  =  ^  ^  [(1)^] 


m=0 


= 


=  a2. 


Ic.  When  A  =  y: 


JV-l 


Var[R.[X^l]  =  ^  ^^  [(-!)=] 


m=0 


=  w- 


(3.160) 


(3.161) 


(3.162) 
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2.  Variance  of  the  imaginary  part; 


(3.164) 


3.9.4  Summary  Of  The  Statistics  Of  The  DFT  Of  Random  Variables.  Given  a  normal 
vector  =  (Xo,A'i,X2,...  jXjv-i)  and  its  mean  vector  jl  =  where  N  is 

an  even  number  and  each  element  Xm  is  a  real  independent  normal  random  variable  such  that  for 

m  =  0, 1,2, ...  ,iV  -  1 


The  elements  of  the  DFT  of  X  has  the  following  distributions; 
Define  the  mean  ,/!«,  of  the  complex  coefficient  by 


(3.165) 


where  fim  is  the  mean  of  the  independent  normal  random  variable  Xm-  we  have 
a.  h  ^  0  and  k  ^  ^ 


Re[Xt]  ~  iV(Re[/ifc],y) 
Im[Xt]  ~  Ar(Im[At],y) 

Re[^]  ~  iV(Re[/it],<T2) 
Im[X*]  =  0. 


(3.166) 

(3.167) 


(3.168) 

(3.169) 
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IV.  Speech  De-noising  Systems 


4.1  Introduction 

In  this  chapter,  we  present  several  speech  de-noising  systems  (SDS)  using  Stein’s  criteria, 
wavelets,  Fourier,  and  both  the  soft  thresholding  technique  (STT)  and  the  hard  thresholding  tech¬ 
nique  (HTT).  We  begin  this  chapter  by  an  overview  of  our  speech  de-noising  algorithm,  a  summary 
of  the  main  characteristics  of  spe  (voiced,  unvoiced,  silent,  pitch,  and  formsint  frequencies),  and 
finally,  we  present  our  SDSs. 

The  speech  de-noising  systems  we  developed  ue  applied  to  noisy  voiced  portions  only.  The 
unvoiced  and  silent  speech  portions  are  processed  using  a  multiplication  ratio  based  on  the  results 
of  de-noising  the  voiced  portions.  Some  of  our  SDSs  use  the  noisy  phase  in  order  to  eliminate  the 
phase  distortions  caused  by  the  non-linear  processing  of  the  STT  and  HTT  thresholding  techniques. 

4.2  Speech  De-noising  Systems  Using  The  SURE  Criteria 

We  present  several  techniques  that  are  based  on  using  the  SURE  criteria  described  in  Chapter 

3.  These  techniques  assume  the  following  restrictions: 

1.  A  clean  speech  signal  has  additive  white  Gaussian  noise  which  has  a  normal  distribution 
with  zero-mean  and  variance  of  a^. 

2.  Only  voiced  speech  is  subjected  to  the  de-noising  process. 

3.  Unvoiced  speech  and  the  silent  portions  are  not  subjected  to  the  de-noising  process, 
instead,  they  are  adjusted  by  an  energy-related  ratio  to  be  defined  later. 

4.  The  location  of  the  voiced,  unvoiced,  and  silent  portions  of  speech  are  assumed  to  be 
known. 

5.  The  variance  required  by  the  SURE  function  is  calculated  using  an  estimate  from  the 
silent  portions  of  the  speech. 
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4.2.1  Characteristics  Of  Speech.  In  order  to  understand  how  the  human  speech  is  pro¬ 
duced,  we  are  obliged  to  study  and  characterize  the  vocal  organs  responsible  for  its  production. 
The  vocal  organs  work  by  using  compressed  air  which  is  supplied  by  the  lungs  through  the  trachea 
(19).  The  compressed  tiir  can  then  be  subjected  to  periodic  pulses  (excitations)  by  the  vocal  cords 
(the  glottis).  The  repetition  rate  of  these  pulses  is  termed  pitch  and  the  resulting  periodic  speech 
is  termed  voiced.  When  the  compressed  air  passing  through  the  vocal  cords  is  not  periodically 
excited  and  is  forced  to  passe  through  a  small  opening,  an  air  turbulence  occurs  and  a  wide-band 
or  broadband  noise-like  sound  is  generated.  This  speech  sound  is  termed  unvoiced.  After  passing 
through  the  glottal  output,  the  speech  sound,  voiced  or  unvoiced,  is  subjected  to  a  filtering  oper¬ 
ation  by  the  shape  of  the  vocal  tract.  This  organ  acts  as  an  acoustical  tube  which  strongly  passes 
some  natural  frequencies  which  are  termed  formants. 

We  conclude  then  that  speech  is  a  signal  that  is  mainly  composed  of  voiced  and  unvoiced 
sounds.  Voiced  speech  is  characterized  by  a  periodic  behavior  where  the  fundamental  frequency 
and  the  pitch  frequency  may  range  from  30Hz  to  about  GUOHz  (19).  The  pitch  varies  between 
males  and  fem^lles.  Normally,  the  pitch  frequency  is  about  125Hz.  In  our  future  discussions,  we 
will  assume  a  typical  oitch  frequency  of  125Hz.  On  the  other  hand,  unvoiced  speech  has  virtually 
no  periodicity  and  behaves  like  wide-band  noise  with  less  energy  than  voiced  speech.  If  a  speech 
signal  is  clean,  the  energy  of  the  periodic  voiced  portions  is  concentrated  in  bands  of  frequencies 
which  are  harmonics  of  the  fundamental  frequency.  The  pitch  frequency,  the  first,  second,  and  third 
formant  frequencies  Me  normally  located  below  the  3kHz  frequency.  The  energy  of  the  unvoiced 
portions  has  a  broad-band  energy  distribution  similar  to  that  of  noise. 

4.2.2  De-noising  Algorithm.  We  developed  a  speech  de-noising  algorithm  having  features 
described  below. 

1.  The  user  inputs  the  following  parameters; 

a.  The  noisy  speech  file  name  and  the  number  of  samples  in  this  file. 
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b.  The  file  containing  the  characteristics  of  each  speech  segment:  start  sample  number,  end 
sample  number,  and  status  (i.e.,  voiced,  unvoiced,  or  sUent). 

c.  The  number  of  overlap  points  between  adjacent  segments. 

d.  The  percent,  p,  of  the  energy  of  the  unvoiced  and  silent  portions  to  keep. 

e.  The  domain  where  the  de-noising  is  to  take  place:  time,  Fourier  (Real  and  Imaginary), 
Fourier  (Real  and  Imaginary)  to  be  constructed  using  noisy  phase,  wavelets,  or  any  combination  of 
the  last  four  domains. 

f.  If  the  wavelets  are  not  involved  in  the  process,  the  user  chooses  between  using  soft  or  hard 
thresholding. 

g.  If  the  user  chooses  the  wavelet  domain,  the  following  parameters  are  also  requested: 

i  .  The  wavelet  filter  and  the  number  of  filter  points. 

ii  .  The  number  of  decomposition  levels. 

iii.  The  thresholding  method  for  the  details:  soft  or  hard  thresholding, 
iv  .  The  de-noising  process  for  the  approximation  coeflScients.  The  choices  include:  soft 
or  hard  thresholding,  no  change  to  the  approximations,  or  energy  reduction  of  the  approximations 
by  the  same  amount  as  the  energy  change,  Rd  ,  of  the  processed  details. 

2.  The  program  searches  for  the  first  silent  pcrtion  and  estimates  the  variances  (see  equation  4.2). 

3.  Using  the  input  information  from  part  1  and  the  variance  from  part  2,  the  program  searches  for 
the  first  voiced  portion,  multiplies  it  by  a  window  function  using  the  overlap  specified  by  the  user 
(see  equation  4.3),  and  applies  the  de-noising  process  specified  by  the  user. 

4.  The  program  Ccilculates  the  energy  ratio  between  tl  e- noised  voiced  portion  and  the  noisy 
voiced  portion. 

5.  After  initializing  the  variance  obtained  by  step  2  and  the  energy  ratio  obtained  by  step  4,  the 
program  steps  through  the  segments  file  starting  from  the  beginning  as  follows: 

a.  read  the  speech  segment  and  multiply  ;t  by  a  window  funcMon  using  the  overlap  specified 
by  the  user. 

4-3 


b.  If  the  segment  is  silent; 

i  .  Update  the  variance. 

ii.  Multiply  this  segment  by  the  energy  ratio  Rv  and  the  percent  choice  p. 

c.  If  the  segment  is  unvoiced,  multiply  this  segment  by  the  energy  ratio  Ji„  and  the  percent 
choice  p. 

d.  If  the  segment  is  voiced: 

i .  Apply  the  de-noising  process  specified  by  the  user. 

ii.  Update  the  energy  ratio  R„. 


4.2.3  Variance  Estimation  And  The  Window  Function.  The  use  of  the  SURE  function 
(see  equation  3.28),  requires  the  knowledge  of  the  variance  <r*.  GeneraUy,  when  processing  noisy 
speech  data,  we  don’t  know  in  advance  the  value  of  this  variance.  One  way  of  estimating  this 
variance,  is  to  detect  the  speech  silent  portions  and  then  use  the  statbtics  about  white  Gaussian 
noise  in  order  to  estimate  the  variance 

Given  a  silent  noisy  speech  portion,  X  —  estimated  the  variance  using  the 

following  consistent  estimators  as  described  in  (12): 


X  = 


N-l 

E  Xi 


i=0 


N  ’ 


(4.1) 


the  estimate  of  the  variance  <r^  is  given  by 


N-l 

EiXi~xr 

i=0 _ 

N 


(4.2) 


We  mentioned  earlier  that  before  processing  any  speech  segment,  we  multiply  it  by  a  window 
function.  In  speech  processing,  it  is  important  to  window  a  speech  data  before  processing  it.  The 
reason  for  using  windows  is  to  analyze  a  finite  segment  at  a  time.  The  length  of  the  window  may 
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vary  depending  on  the  desired  properties  of  the  signal  under  analyses  (i.e.,  pitch,  time  resolution, 
frequency  resolution).  However,  both  the  type  and  the  filtering  characteristics  of  the  window 
function  play  an  important  role  in  the  results  of  the  analysis.  Ideally,  we  would  like  a  window 
whose  Fourier  transform  does  not  have  any  side-lobe  peaks.  In  practice,  we  use  many  different 
windows,  such  as  the  Bartlett  window.  The  Hamming  window,  and  the  Hanning  window. 


Since  parts  of  our  algorithm  use  the  discrete  wavelet  transform  (DWT)  which  is  implemented 
using  a  periodic  extension  of  the  signal  under  analysis,  we  chose  to  implement  our  window  using 
smooth  functions.  The  trigonometric  functions,  sines  and  cosines,  are  good  examples  of  smooth 
function.  Our  window  is  implemented  as  follows: 


win(k)  =  < 


sm 


4-§<fc<«6  +  | 
1  t6 +  !<*:<  te-| 

1  -  sin2(^)  t«-f<fc<«e  +  f 
0  elsewhere. 


(4.3) 


where  6  is  the  overlap  between  adjacent  windows  (i.e.,  all  our  speech  experiments  have  an  overlap 
of  ^  =  16).  The  overlap  between  three  adjacent  windows  are  illustrated  in  figure  4.1.  Figure  4.2 
illustrates  the  window  described  by  equation  4.3  and  its  Fourier  transform.  Observe  that  the  time 
domain  function  has  smooth  transitions  from  both  ends  in  order  to  avoid  the  introduction  of  sudden 
discontinuities  caused  by  a  purely  rectangular  '.  indow. 


4-2.4  De-noising  The  Unvoiced  And  Silent  Portions  Of  Speech.  The  unvoiced  and  silent 
portions  of  noisy  speech  have  characteristics  that  are  similar  to  the  characteristics  of  noise.  Since 
the  SURE  function  treats  them  as  white  Gaussian  noise  and  tries  to  eliminate  these  portions,  we 
decided  to  de-noise  only  the  voiced  portions  (see  figures  D.l  through  D.5  for  white  Gaussian  noise 
and  figures  D.6  through  D.15  for  unvoiced  speech).  The  speech  without  silent  and  unvoiced  portions 
tends  to  sound  distorted  and  is  hard  to  understand.  For  these  reasons,  we  choose  not  to  process  the 
silent  and  unvoiced  portions;  instead,  we  multiply  them  by  the  percent  (p  =  50%)  and  the  energy 
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wii]dow(oveil4>  s  16 ) 


50  100  150  200  2^ 

freq(Hz) 


Figure  4.2  Speech  window  and  its  Fourier  transform. 


ratio  {R„)  of  the  change  of  the  energy  of  the  voiced  portions  as 


where 

N-l 

£.  =  E  (4.5) 

n:sO 


and 


(4-6) 

n=0 

where  and  are  the  noisy  and  thresholded  (STT  or  HTT)  voiced  speech  samples,  respectively. 
Since  the  noisy  samples  are  thresholded,  we  have  El  <  E^.  The  voiced  ratio  is  then  constrained  as 

0  <  <  1.  (4.7) 

The  new  silent  and  unvoiced  samples  are  then  defined  as 

^  (4.8) 

^  (4.9) 

where  Xn’”‘”*’'\  Xn and  x|^’”*”’^  are  the  silent  noisy  samples,  unvoiced  noisy  sam¬ 

ples,  the  silent  reduced  samples,  and  unvoiced  reduced  samples,  respectively.  The  ratio  R„  helps 
balance  the  energy  between  the  voiced,  unvoiced,  and  silent  portions,  as  well  as  reduce  the  power 
of  the  noise  in  the  sUent  and  unvoiced  portions. 
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i.2.5  De-noising  in  the  time  domain.  Given  a  noisy  voiced  speech  signal  . .  ,Xjv-i) 


such  that 


Jl  =  §+g,  (4.10) 

where  S  =  (80,81,82, . .  •  ,8n-i)  is  a  clean  speech  vector  and  ^  =  (Zo,Zi,Z2,..  ■  ,Zjv_i)  is  a  white 
Gaussian  noise  vector  such  that  for  m  =  0, 1, 2, . . . ,  JV  —  1 

Zm  ~  N(0,tT^), 

the  expected  value,  fl,  of  the  clean  speech  data  §  is  given  by 


where  ji  =  •  •  •  ,Mjv-i)- 

The  noisy  vector  Jl,  which  is  formed  by  the  sum  of  the  constant  vector  §  and  the  normal  vector 
has  a  normal  distribution  with  mean  p,  such  that 

(4.11) 

where  m  =  0, 1, 2, . . . ,  fV  —  1. 

Since  X  has  a  normal  distribution,  we  can  directly  use  the  time  domain  speech  data  degraded 
by  white  Gaussian  noise  with  the  SURE  function  (see  figure  4.3).  The  time  domain  speech  de- 
noising  system  (SDS)  has  the  advantage  of  not  requiring  further  transformations  which  aire  time 
consuming.  However,  the  application  of  either  the  soft  or  the  hard  thresholding  techniques  to  a 
segment  of  speech  in  the  time  domain,  uses  a  single  threshold  to  adjust  a  whole  window  of  speech. 
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This  threshold  may  not  be  sufficient  to  eliminate  most  of  the  noise  and  hence  we  may  expect  that 
the  output  of  the  time  SDS  to  be  only  slightly  cleaner  than  the  input  speech. 


Noisy  Speech 

Speech  Denoting 

Processed 

0 

System 

Speech 

(SDS) 

Figure  4.3  Speech  de-noising  in  the  time  domain 


4-2.6  De-noiaing  in  the  time  domain  using  the  noisy  phase.  In  the  speech  processing 
field,  it  is  believed  that  some  of  the  distortion  caused  by  de-noising  speech  data  is  mainly  due  to 
the  change  of  the  phase  in  the  Fourier  representation  of  speech.  These  distortions  may  diminish 
the  intelligibility  of  the  de-noised  speech.  In  order  to  .  tudy  the  effect  of  the  phase,  0,  on  our  SDS 
and  on  the  intelligibility  of  the  de-noised  speech,  we  save  the  noisy  phase  for  reconstruction  and 
apply  the  de-noising  techniques  described  in  the  previous  section  (see  figure  4.4). 

Although  this  technique  improves  intelligibility,  it  requires  more  processing  due  to  the  Fourier 
transform  and  more  storage  due  to  phase  saving. 

4-2.7  De-noising  in  the  frequency  domain.  We  have  seen  that  if  the  real  and  imaginary 
parts  of  a  complex  random  variable  are  normal,  the  amplitude  and  phase  can’t  be  normal.  Since 
the  Fourier  transform  is  a  linear  operation,  the  Fourier  transform  of  a  normal  multivariate  vector 
is  also  normal.  However,  the  variance  of  the  Fourier  transform  coefficients  were  shown  to  be  not 
identical  (e.g.,  dc  component).  Recall  the  discrete  Fourier  transform  (DFT)  of  a  periodic  finite- 
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Figure  4.4  Speech  de-noising  in  the  time  domain  using  noisy  phase 
length  sequence  of  N  points,  “ 

*  =  (4-12) 

m=0 

where  0  <  fc  <  iV  —  1.  We  have  shown  that  if  the  time  sequence  {Xm}^o  ^  normal  distribution 
such  that  each  random  variable  Xm  ~  the  Fourier  sequence  is  conjugate  sym- 

metric  such  that  the  real  and  imaginary  parts  of  Xk  have  the  normal  distributions  Af(Re  [/it] , 
zmd  iV(Im[/ifc] ,  ^)  for  1  <  fc  <  y,  respectively.  However,  the  0*^  and  the  y  real  and  imagi¬ 
nary  elements  are  distributed  according  to  JV(Re[/it]  ,(r^)  and  Ar(Im[^tl  >0),  respectively.  This 
property  of  the  Fourier  coeflScients  allows  us  to  use  the  sequence  of  real  and  imaginary  elements  1 
through  (y  —  1),  inclusive,  with  the  SURE  function  which  requires  the  input  random  variables  to 
be  normal,  independent,  and  to  have  the  same  variance. 

The  method  calls  for  processing  separately,  the  two  time  sequences 
f  -  r  . 

<  Re[.Yjt]  >  and  <  Im[Xik]  >  ,  where  each  element  has  a  normal  distribution  with  vari- 

2 

ance  y  (see  figure  4.5).  After  the  application  of  the  SURE  threshold,  the  real  and  imaginary 
outputs  are  combined  with  the  original  dc  component  and  the  y  component  and  then  inverse 
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Fourier  transformed  to  produce  back  the  time  domain  de-noised  signal.  The  elements  {Xo,X^} 
2ire  left  untouched  because  of  their  unique  distributions  and  characteristics  (see  equations  3.168 
and  3.169).  Depending  on  how  the  DFT  is  de&ned,  the  dc  component,  Xo,  is  a  measure  of  the 
mean  of  the  time  sequence  {Xm}m=0’  while  Xj^  is  the  high  frequency  component.  Since  noise  is 
generally  composed  of  high  frequencies,  little  or  no  modification  to  the  dc  component  may  occur. 


Figure  4.5  Speech  de-noising  in  the  frequency  domain 


4.2.7. 1  Soft  Thresholding  Of  Complex  Data.  When  using  the  shrinkage  or  the  soft 
thresholding  technique  (STT)  in  the  Fourier  domain,  the  real  and  imaginary  parts  are  affected  in 
a  way  that  affects  the  phase  of  the  complex  Fourier  coefficients  being  de-noised.  Consider  the 
Fourier  coefficient  where  k  =  1,2, . . .  ,(y  —  1),  and  denote  the  real  and  imaginary  soft  thresholds 
by  and  tf'",  respectively.  Because  of  the  definition  of  the  STT,  which  pulls  a  noisy  data  sample 
towards  zero  if  its  magnitude  is  greater  than  the  threshold  or  sets  it  to  zero  otherwise,  we  have  four 
different  cases  (see  figure  4.6). 

Define  the  new  modified  complex  number,  by 

=  Re  [Xfc’*”"]  -1-  *  Im  ,  (4.13) 
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Figure  4.6  Four  possible  changes  and  orientations  of  a  de-noised  complex  number  using  the  STT. 
where  the  de-noised  real  part  is  defined  as  (see  equation  3.32) 

Re[X‘  *‘’/‘]  =  Re[X*]  -  min^  Re[X*]  ,tf*  j  sgn(Re[Xfc]) ,  (4.14) 


and  the  imaginary  part  is  defined  as 


=  lm[Xk]  -  min^  j  sgn(^Im[X*]) 


Combining  equation  4.14  and  4.15  we  get 


(4.15) 


=  Re  [Xfc’"''*]  -t-  i  Im 

=  Re  [Xfc]  -  min  ^  Re  [Xfc]  | ,  tf*^  sgn  ^Re  [-^t]  j  j  + 

i  Im  [Xit]  -  min  ^1  Im  [X*]  j ,  sgn  ^Im  [X*]  j  j 

=  Re[Xfc]-t-iIm[X*]]  - 

min  ^|Re[Xfc]|,e^  sgn^Re[Jtfe])  -1-  i  min  ^|lm[Xfc]j,ti’"^  sgn^Im[Xfc]^j 
=  Xk+g*'^°f*[Xk], 


(4.16) 
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where 


=  -  min^  Re[Xt]|,tf*^  sgn^RefJ^*])  -  i  min^  Im[Xfc]  sgn^ImfX*]). 


(4.17) 


The  above  function  is  the  complex  equivalent  of  the  real  function  defined 

by  equation  3.31.  The  phase  (provided  it  exists)  of  the  de-noised  complex  coefficient,  is 

defined  as 


nt.toft 


=  arctan 


=  arctan 


Im[X^ 


.Im[Xt]  -  min^  Iin[Xib]|,tf"‘^  8gn^Im[Xfc]^ 
■Re[Xt]  -  min^  Re[Xfc]  sgn^Re[Xfc]^ 


(4.18) 


On  the  other  hand,  the  phase  Ok  of  the  noisy  coefficient  Xk  is  defined  as 


Ok 


arctan 


Im[Xt] 

Re[Xfc] 


(4.19) 


We  see  from  equations  4.18  and  4.19  that  this  new  shrinkage  technique  applied  to  the  real 
and  imaginary  parts  separately,  has  the  potential  to  introduce  a  lot  of  distortion  due  to  the  phase 
changes  of  the  entire  frequency  spectrum.  In  fact  when  the  thresholds  act  on  the  real  and  imaginary 
parts,  the  phase  can  take  any  value  within  its  dommn  (see  case  ^  jjj  figure  4.6).  One 

way  of  avoiding  more  phase  distortior  than  present  in  the  noisy  signal  is  to  keep  the  original  noisy 
phase  and  use  it  in  the  inverse  Fourier  transform  back  to  the  time  domain. 
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4. 2. 7. 2  Hard  Thresholding  Of  Complex  Data.  When  using  the  Hard  Thresholding 
Technique  (HTT)  in  the  Fourier  domain,  the  real  and  imaginary  parts  are  also  affected  in  a  way 
that  affects  the  phase  of  the  complex  Fourier  coefficients  being  de-noised.  Consider  the  Fourier 


coefficient  where  fc  =  1, 2, . . . ,  (y  —  1),  and  denote  the  real  and  imaginary  hard  thresholds  by 
and  respectively.  Because  of  the  definition  of  the  HTT,  which  sets  a  noisy  data  sample  to  zero 
if  its  magnitude  is  less  than  the  threshold  ,  we  have  four  different  cases  (see  figure  4.7). 


1 

gt-l>or4  ^ 

^  original  vector 

/  . 

a/m  1 

jl-KarJ  _  Q 

/  . 

xlm  , 

0t-^hard  ^  ^ 

a/m  , 

set  to  zero 

aAe 

Figure  4.7  Four  possible  changes  and  orientations  of  a  de-noised  complex  number  using  the  HTT. 


Define  the  new  modified  complex  number,  by 


X 


tyhard 

k 


=  Re  [Xj’'*"'’*']  +  i  ImfXfc’'*"''''] , 


(4.20) 


where  the  de-noised  real  part  is  defined  as  (see  equation  3.45) 


Re[Xf  “’•‘'1  =  Re[Xfe]  (r®  [X^] ) , 


and  the  imaginary  part  is  defined  as 


Im[X‘-'“*'-'']  =  Im[X*]  (lm[Xfc]). 


(4.21) 


(4.22) 
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Combining  equation  4.21  and  4.22  we  get 


=  Re[Xk]  X[-t«‘,t«‘](^Re[Xk]^  +*Im[jft]  j 


(4.23) 


where 


(4.24) 


The  above  g^M^d  function  is  the  complex  equivalent  to  the  real  gl'^"^(X)  function  defined  by 


equation  3.46.  The  phase  (provided  it  exists)  of  the  de-noised  complex  coeflScient,  Xl'''“^‘^,  is 


defined  as 


=  arct£m 


=  arctan 


=  arctan  tan[0fc] 


Re[;C-*’'*“’’‘']J 

R-e[Xfc]  ^Re[Xib]^ 


where  the  phase  ffk  of  the  noisy  coefficient  Xk  is  defined  as 


(4.25) 


Ok  =  arctan 


Re[X*]  ■ 


(4.26) 
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We  see  from  equations  4.25  and  4.26  that  when  the  HTT  is  applied  to  the  real  and  imaginary  parts 
separately,  it  has  the  potential  to  introduce  a  lot  of  distortion  due  to  the  phase  changes  of  the  entire 
frequency  spectrum.  However,  these  phase  distortions  can  take  only  four  different  forms: 

1.  Don't  change  the  phase. 

2.  set  the  phase  to  zero. 

3.  set  the  phase  to  y. 

4.  set  the  noisy  data  to  zero,  changing  the  phase  from  defined  to  undefined. 

Speech  de-noiaing  in  the  frequency  domain  using  noisy  phase.  It  was  noted  in  the 
previous  section  that  without  saving  the  noisy  phase,  we  might  introduce  many  phase  distortions 
to  the  speech  signal.  In  order  to  improve  intelligibility,  we  save  the  noisy  phase  and  use  the  same 
thresholding  process  as  before  (see  figure  4.8).  In  order  to  restore  the  noisy  phase  0,  we  need  to 
first  apply  the  thresholding  technique  as  in  the  previous  section,  calculate  the  amplitude  of  the 
modified  Fourier  coefficients,  and  then  combine  the  amplitude  with  the  noisy  phase. 


Figure  4.8  Speech  de-noising  in  the  frequency  domain  using  noisy  phase 
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4.2.8. 1  Soft  Thresholding  Of  Complex  Data  With  Noisy  Phase  Restoration.  Con¬ 
sider  the  Fourier  coefficient  where  k  =  1, 2, . . . ,  (y  —  1),  and  denote  the  real  and  imaginary  soft 
thresholds  by  tf*  and  tj"*,  respectively.  Define  the  modified  complex  coefficient  by  the  shrinkage 
technique  (see  equation  4.13)  as 

^Mo/t  ^  [X -I-  i  Im  ,  (4.27) 

and  denote  the  new  modified  complex  coefficient  with  noisy  phase  restoration  by 

^  I  ,  (4.28) 

where  phase  Ok  is  defined  by  equation  4.19  and  the  real  and  imaginary  components  of  are 

as  defined  by  equations  4.14,  4.15  ,  respectively.  In  rectangular  form,  we  have 

^  +  i  Im .  (4.29) 

Expanding  equation  4.28,  the  new  de-noised  real  part  is  defined  as 

cos(<^*),  (4.30) 

and  the  imaginary  part  is  defined  as 

siniOk).  (4.31) 

This  new  shrinkage  technique  takes  advantage  of  the  normal  distribution  properties  of  the 
real  and  imaginary  parts  in  order  to  shrink  the  amplitude.  Pictorially,  there  are  four  different 
cases  that  we  need  to  consider  (see  figure  4.9).  When  applied  to  the  complex  number  Xk,  this  new 
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shrinkage  technique  has  no  effect  on  the  phase  (since  the  noisy  phase  is  restored),  however,  the 
amplitude  is  affected  in  one  of  two  different  manners; 


1.  The  amplitude  is  shrunken  toward  zero  by  a  nonzero  amount. 

2.  The  amplitude  is  set  to  zero. 

O 

We  see  then  that  there  are  a  lot  of  advantages  to  keeping  the  noisy  phase  so  that  when  we  inverse 
Fourier  transform,  many  of  the  potential  phase  distortions  due  to  the  thresholding  techniques  are 
eliminated. 
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Figure  4.9  Four  possible  changes  and  orientations  of  a  de-noised  complex  number  with  noisy 
phase  restoration. 


4.2.8.2  Hard  Thresholding  Of  Complex  Data  With  Noisy  Phase  Restoration.  Fol¬ 
lowing  the  same  procedure  as  before  and  denoting  the  modified  complex  coefficient  by  the  hard 
thresholding  technique  (see  equation  4.20)  as 


(4.32) 


and  the  new  modified  complex  coefficient  with  noisy  phase  restoration  by 


■v-tfhard—S  I  -y-t.hard  0^ 


(4.33) 
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we  obtain  the  same  results  as  the  shrinkage  technique.  In  fact  all  the  equations  of  the  HTT  are 
the  same  as  the  STT  except  for  the  naming  designators  (soft  and  hard).  Again  four  cases  can  be 
considered  (see  figure  4.9). 

4-3  Application  Of  SURE  To  DWT 

We  have  seen  that  the  wavelet  transform  is  a  linear  operator  and  that  the  detaU  coefficients, 
at  a  decomposition  level  m,  measure  the  degree  of  similarity  between  the  signal  f{t)  and  the  the 
analyzing  wavelet  V’m,n(t);  furthermore,  the  details  give  us  some  degree  of  information  concerning 
the  frequency  content  of  the  signal  f{t).  Recall,  due  to  the  down-sampling  and  filtering  operations 
performed  during  the  decomposition  process,  the  lower  levels  (i.e.,  m  =  1,2,..)  represent  high 
frequency  information,  and  the  higher  levels  (i.e.,  m  =  M,  M  —  1,  Af  —  2, ..)  represent  low  frequency 
information. 

Define  a  noisy  vector  ft  —  (Xq, Xi,X2,...,A'jv_i)  such  that 

it  =  §-k-Z,  (4.34) 

where  5  =  (5o,  5i,  52, . . . ,  5jv_i)  is  a  clean  data  vector  and  ^  =  (Zo, Zi,  Z2, . . . ,  Zn-i)  is  a  white 
Gaussian  noise  vector  such  that  for  m  =  0, 1, 2, . . . ,  iV  —  1 

Zm^N{0,(T^). 

Since  ^  is  a  constant  clean  data  vector,  the  expected  value  of  this  vector  is  the  vector  ft  such  that 

5  = /I, 


where  p,  =  (/to,^i,M2> •  •  • 

O 

The  noisy  vector  ft,  which  is  formed  by  the  sum  of  a  constant  vector  §  and  a  normal  vector  has 
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a  normal  distribution  with  mean  jL  such  that 


(4.35) 


where  m  =  0, 1, 2, . . . ,  iV  —  1. 

4.S.I  Voiced  speech  vs.  Wiite  Gaussian  Noise.  The  wavelet  decomposition  of  the  normal 
random  vector  at  the  m*'‘-level  (1  <  m  <  M)  is  given  by  equations  3.106  and  3.107  as 


fc€Z 

(4.36) 

■®m,n  ~  ^m— 

kez 

(4.37) 

where  and  D*  denote  the  approximation  and  detail  random  variables  with  respect  to  Jt, 
respectively.  Since  the  DWT  is  linear  and  orthogonal,  we  have 


(4.38) 

*'m,n  »  ^m,n’ 

(4.39) 

Using  the  above  results  and  the  fact  that  the  DWT  coefficients  are  also  independent  and  nor¬ 
mally  distributed,  it  can  easily  be  shown  that  the  wavelet  coefficients  (details  and  approximations) 
of  the  white  Gaussian  noise,  2,  at  the  decomposition  level,  are  themselves  white  Gaussian 
noise  with  zero- mean  and  the  same  variance,  This  normal  distribution  property  of  the  wavelet 

coefficients  makes  them  candidates  for  use  with  the  SURE  function  developed  earlier.  Since  the 
detail  coefficients  measure  the  amount  of  some  frequencies  in  a  well  defined  band  of  frequencies 
(depending  on  the  decomposition  level  m  and  the  analyzing  wavelet  V'm.n),  we  can  directly  apply 
the  de-noising  process  to  certain  bands  of  frequencies  where  the  white  Gaussian  noise  has  a  high 
probability  of  residing.  Since  the  formant  frequencies  of  voiced  speech  are  relatively  low-frequencies 
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(below  3kHz),  and  white  Gaussian  noise  uniformly  contains  all  frequencies  ,  the  early  stages  of  de¬ 
composition  have  a  high  probability  of  filtering  most  of  the  high  frequencies  that  are  due  to  noise, 
while  the  later  stages  of  decomposition  filter  the  voiced  speech  signal  (see  figures  E.l  through  E.IO 
for  voiced  speech  de-noising  using  shrinkage). 

Since  both  the  STT  and  the  HTT  techniques  are  non-linear  thresholding  techniques,  we 
decided  to  DWT  (discrete  wavelet  transform)  our  signal  up  to  a  decomposition  level  where  the 
pitch  frequency  is  not  affected  by  the  non-linear  thresholding  (Note:  our  algorithm  gives  you  an 
option  to  process  both  the  approximations  and  the  details).  Recall  that  the  DWT  is  a  filtering 
operation  that  uses  a  low-pass  filter  (h)  and  a  high-pass  filter  (g).  At  each  level  of  decomposition, 
the  high-pass  g  filter  divides  the  frequency  spectrum  by  half.  Given  a  noisy  voiced  speech  signal 
where  the  pitch  frequency  fp  is  known  and  a  sampling  frequency  /«  =  16kHz,  the  maximum 
resolvable  frequency  is  ^  =  8kHz  (14)  (17).  In  order  not  to  affect  the  pitch  frequency,  we  need  to 
decompose  up  to  a  level  m  <  m„  where 


= 


(4.40) 


where  [.J  is  the  fioor  function.  Since  our  speech  data  is  sampled  at  16kHz  and  we  are  assuming  a 
typical  pitch  frequency  of  125Hz,  the  m„  value  is  6.  By  decomposing  the  signal  up  to  the  m‘'‘-level 
and  applying  our  thresholding  techniques,  we  have  a  high  chance  of  eliminating  most  of  the  noise 
in  the  first  m„  levels  without  affecting  the  pitch  of  the  voiced  speech  which  resides  in  the  remaining 
coarser  levels.  This  partial  wavelet  decomposition  of  the  voiced  speech  signal  yields  voiced  speech 
where  the  structure  of  the  pitch  is  not  subjected  to  the  thresholding  techniques,  i.e.,  the  pitch  is 
contained  mostly  at  the  approximation  levels  (see  figure  4.10). 


4-3.2  Wavelet  Coefficients  Thresholding.  Having  determined  the  maximum  level  of  de¬ 
composition,  Tn„,  we  can  apply  either  the  soft  thresholding  technique  or  the  hard  thresholding 
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Figure  4.10  Filtering  noise  and  voiced  speech  by  DWT  of  voiced  speech  up  to  the  m‘'‘-level. 

technique  to  each  of  the  m„  levels  of  details.  Consider  the  noisy  signal  with  N  =  2^  points. 
We  know  that  by  applying  the  DWT  discussed  earlier,  the  total  number  of  decomposition  levels  is 
M.  Define  the  detail  coefficients  of  the  decomposition  level  by  Dm,n  where  1  <  to  <  to®  and 
0  <  n  <  2"-"*  -  1. 

4-3.2. 1  Wavelet  Shrinkage  Of  The  DetaU  Coefficients.  Consider  the  to‘^  decompo¬ 
sition  level  and  denote  the  soft  threshold  at  this  level  by,  Vp.  Define  the  shrunken  version  of  the 
detml  coefficient  D^.n  by 

•Dnl.n  —  Dm,n  ~  niin(|Dm,n| 
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where  0  <  n  <  2^  We  can  then  define  a  function  g\i  which  is  level  dependent,  such  that 


ffn  {dm)  =  -  min(|Z)m,„|,t7*)  8gn(Dm,„),  (4.42) 

where  Dm  is  the  vector  whose  elements  are  the  2^"”*  detail  coefficients  at  the  decomposition 
level.  Since  we  have  m„  levels,  we  have  m„  thresholds  t™,  and  functions  g\c  (Dm)- 

4-3. B.2  Hard  Thresholding  Of  The  Wavelet  Detail  Coefficients.  In  a  similar  fashion, 
we  can  apply  the  hard  thresholding  technique  to  the  wavelet  details  and  the  results  are  similar  to 
the  shrinkage  case.  Consider  the  decomposition  level  and  denote  the  hard  threshold  at  this 
level  by,  Define  the  hard  thresholded  version  of  the  detail  coefficient  Dm,n  by 

Dm,n  —  Dm,n  Xf-tJ’.tJ*]  {Dm,n)  i  (^-^3) 

where  0  <  n  <  2^“”*.  We  can  also  define  a  function  g*^  that  is  level  dependent,  such  that 

(dm)  —  ~Dm,n  ~  >  (4.44) 

where  Dm  is  the  vector  whose  elements  are  the  detail  coefficients  at  the  m*^  decomposition 

level.  Since  we  have  levels,  we  have  m„  thresholds  and  functions  gif  (Dm)- 

4-3. 2.3  De-noising  The  Approximations.  Since  the  pitch  of  the  voiced  speech  is 
represented  by  the  approximation  coefficients  at  the  decomposition  level.  The  total  number  of 
these  coefficients  is  .  In  order  to  prevent  this  voiced  signal  from  being  distorted,  we  choose 

to  either  leave  the  approximation  coefficients  <  (7m,, n  f  untouched  or  adjust  their  energy 

t  J  n=0 

by  the  same  amount  as  the  energy  change  of  all  the  thresholded  details  (STT  or  HTT).  In  other 
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words,  the  ratio  between  the  energies  of  the  noisy  details  and  the  de-noised  details  is  defined  by 


(4.45) 


where 


and 


E' 

m=l  ^  n=0 


m, 

E  [<n] 

m=l  ^  n=0 


(4.46) 


(4.47) 


Since  the  noisy  details  are  thresholded,  we  have  Ejy  <  Ed.  The  det^  ratio  is  then  constrained  as 


0  <  <  1.  (4.48) 

The  new  approximation  coefficients  at  the  m^-level  are  then  defined  as 

^m,,n  ~  Cfn,,ni  (4.49) 

where  0  <  n  <  The  ratio  Rp  helps  balance  the  energy  between  the  approximations  and 

the  details  as  well  2is  reduce  the  power  of  any  noise  that  passed  through  the  m„  decomposition  level 
(see  figure  4.11). 

4.3.3  De-noising  The  DWT  of  The  Time  Domain.  We  have  seen  that  the  the  wavelet 
transform  of  a  normal  multidimensional  random  vector,  produces  &  set  of  detail  coefficients  vectors 
that  are  also  normal.  By  applying  the  SURE  thresholding  techniques  to  these  details  (see  figure 
4.12),  we  can  eliminate  most  of  the  noise  at  the  first  m„  levels.  Since  the  wavelets  are  band-pass 
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Figure  4.11  Wavelet  reconstruction  of  the  thresholded  (STT  or  HTT)  voiced  speech  starting  from 
the  m'^-level  where  1  <  Tn„  <  Af  to  the  zeroth  level  where  the  number  of  samples  is 

i\r  =  2" 

filters,  at  each  level  of  decomposition,  an  entire  band  of  high  frequencies  is  being  de-noised.  We 
expect  then  that  the  output  of  this  method  to  eliminate  most  of  the  high  frequencies  that  are 
mainly  due  to  noise. 

A  variation  of  this  purely  time-wavelet  domain  scheme  may  be  employed  to  minimize  the 
phase  distortion  introduced  by  the  nonlinear  efi'ect  of  the  thresholding  techniques  (STT  and  HTT). 
In  order  to  reduce  the  effect  of  phase  distortions,  we  may  save  the  noisy  phase  from  the  Fourier 
transform  of  the  noisy  voiced  speech  and  restore  it  after  the  de-noising  procedures.  Figure  4.13 
illustrates  the  method;  the  time  domain  voiced  speech  waveform  is  first  Fourier  transformed  to 
extract  the  phase  and  then  wavelet  transformed  before  the  de-noising  process  is  applied.  The 
thresholded  detciils,  are  then  inverse  wavelet  transformed,  Fourier  transformed  in  order  to  extract 
the  de-noised  amplitude.  Finally,  the  old  phase  is  combined  with  thb  newly  calculated  amplitude 
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Figure  4.12  Speech  de-noising  in  the  time  domain  using  wavelets 


and  inverse  Fourier  transformed  back  to  the  time  domain.  Observe,  this  method  requires  three 
Fourier  transforms  and  two  wavelet  transforms. 


Figure  4.13  Speech  de-noising  in  the  time  domain  using  noisy  phase  and  wavelets 

De-noising  The  DWT  of  The  Fourier  Domain.  We  have  seen  that  the  the  Fourier 
transform  of  a  normal  multidimensional  random  vector,  produces  a  set  of  real  and  imaginary  coef¬ 
ficients  that  are  also  normal.  Since  the  wavelet  transform  is  a  linear  and  orthogonal  operation,  the 
wavelet  transform  of  the  Fourier  transform  of  a  normal  multidimensional  random  vector  produces 
a  normal  complex  vector.  Let  f(t)  6  L^(R)  and  define  its  Fourier  transform  by 

1  f*"** 

= -y=  /(t)e— ‘dt.  (4.50) 

y2ir  J _oo 
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The  continuous  wavelet  transform  with  scale  a  and  shift  b  of  the  Fourier  transform  of  /,  is  defined 


as 

W“’‘[(J^/)(a;)]  =  (4.51) 

J  —  oo 

where  (o,  b)  €  R'*'  x  R  and 

V’a,k(w)  = 

Substituting  equation  4.50  into  equation  4.51,  we  get 

,  f+oo  r+oo 
\  Z'K  J—oo  •/—oo 

/+00  I  i*+00 

oo  J  oo 

mKhi^)dt,  (4.53) 

where  for  real  wavelets, 

O 

^j(t)  =  Vae-‘*’“^(at),  (4.54) 

and  ^  is  the  Fourier  transform  of  the  mother  wavelet  Equation  4.53  represents  the  inner  product 
of  /(f)  with  respect  to  the  wavelet  based  function  ^(f).  In  other  words,  W“’*[(.F/)(ci;)]  represents 
the  similarity  between  /(f)  and  the  function  j(f),  which  acts  like  a  window  on  the  signal,  /(f). 

By  applying  the  SURE  thresholding  techniques  to  the  real  and  imaginary  wavelet-Fourier 
details  (see  figure  4.14),  we  can  eliminate  most  of  the  noise  at  each  decomposition  level. 

A  variation  of  this  purely  wavelet-Fourier  domain  scheme  may  be  employed  to  minimize  the 
phase  distortion  introduced  by  the  nonlinear  effect  of  the  thresholding  techniques  (STT  and  HTT) 
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Figure  4.14  Speech  de-noising  in  the  frequency  domain  using  wavelets 

on  the  real  and  imaginary  parts  of  the  wavelet-Fourier  details.  In  order  to  reduce  the  effect  of 
phase  distortions,  we  may  save  the  noisy  phase  from  the  Fourier  transform  of  the  noisy  voiced 
speech  and  restore  it  after  the  de-noising  procedures.  Figure  4.15  illustrates  the  method;  the  time 
domain  voiced  speech  waveform  is  first  Fourier  transformed  to  extract  the  phase  and  then  the 
wavelet  transform  of  the  Fourier  transform  is  taken  before  the  independent  de-noising  process  of 
the  real  and  imaginary  parts  is  applied.  The  thresholded  details  (real  and  imaginary),  are  then 
inverse  wavelet  transformed  independently  in  order  to  produce  the  de-noised  real  and  imaginary 
parts,  Fourier  transformed  in  order  to  extract  the  de-noised  amplitude,  and  finally,  the  old  phase  is 
combined  with  this  newly  calculated  amplitude  and  inverse  Fourier  transformed  back  to  the  time 
domain.  Observe,  this  method  requires  two  Fourier  transforms  and  four  wavelet  transforms  (see 
figure  4.15). 
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Figure  4.15  Speech  de-noising  in  the  frequency  domain  using  noisy  phase  and  wavelets 
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V.  Experiments  And  Results 


5.1  Experiments 

In  this  chapter,  we  present  the  results  of  appl3ring  the  thresholding  techniques  we  developed 
in  the  last  chapter.  Eight  different  speech  processing  systems  were  studied.  We  start  by  explaining 
the  assumptions  made  and  the  parameters  used  for  each  experiment.  We  then  discuss  the  quanti¬ 
tative  and  qualitative  results  for  all  eight  experiments,  as  well  as  the  spectrum  analysis  for  some 
experiments.  The  qualitative  results  are  based  on  the  total  squared  error  between  each  experiment’s 
output  and  both  the  clean  and  noisy  signals.  On  the  other  hand,  the  qualitative  results  are  all 
based  on  the  results  of  the  listening  tests  that  we  conducted  with  an  untrained  jury  of  six  students 
(four  males  and  two  females).  Before  each  informal  listening  test,  the  listener  is  given  a  chance  to 
listen  to  both  the  clean  and  noisy  speech  speech  signals  (SNRs  of  Odb  and  6db)  and  then  he  or 
she  is  briefed  about  what  the  test  is  all  about  (see  figure  5.1).  The  listeners  were  asked  to  make 
a  choice  between  two  de-noised  speech  signals  (e.g.,  choice  between  time  processing  vs.  Fourier 
processing  of  the  same  noisy  signsJ).  Finally,  we  present  and  analyze  some  spectrograms  of  four 
different  de-noising  methods.  We  conclude  this  chapter  with  a  summary  of  the  tests'  results  and 
some  of  the  recommendations  we  encountered  throughout  this  thesis  work. 
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5.1.1  Experimental  Set  Up.  Due  to  the  large  number  of  methods  and  the  flexibility  of 
the  parameters  available  for  experiments,  we  fixed  the  following  inputs  to  the  speech  de-nobing 
algorithms  we  presented  in  chapter  four: 

1.  The  percent  factor  p  applied  to  the  unvoiced  and  silent  portions  is  p  =  50%. 

2.  The  maximum  voiced  decomposition  level  is  =  6. 

3.  The  overlap  between  adjacent  speech  windows  b  overlap  =  16- 

4.  The  number  of  samples  of  the  original  speech  (“They  enjoy  it  when  I  audition”)  is 
N  =  31200. 

5.  The  sampling  frequency  is  16kHz. 

6.  The  approximation  coefficients  (pitch  of  voiced  speech)  are  not  processed  (i.e.,  untouched 
and  still  noisy). 

Experimentally,  we  fixed  the  overlap  between  adjacent  windows  (p  =  16),  and  we  determined 
that  by  keeping  only  p  =  50%  of  the  ratio  11^  (see  equation  4.4),  the  transiti.'  j  obtained  between 
the  voiced  portions  to  both  the  silent  and  the  unvoiced  portions  improved  intelligibility  considerably. 

5.1.2  Experimental  Speech  Signals.  Starting  with  a  clean  speech  signal  (“They  enjoy  it 
when  I  audition”)  of  31200  samples,  we  generated  seven  different  white  Gaussian  nobe  signab  and 
seven  noisy  signals  such  that  the  signal-to-nobe-ratios  (SNRs)  are  as  follows:  -lOdb,  -6db,  -3db, 
Odb,  3db,  6db,  and  lOdb.  Using  these  noisy  signals,  we  produced  both  soft  thresholded  and  hard 
thresholded  signals  with  the  following  methods: 

1.  De-nobing  in  the  time  domain. 

2.  De-nobing  in  the  time  domain  using  the  noby  phase. 

3.  De-nobing  in  the  frequency  domain. 

4.  De-nobing  in  the  frequency  domain  using  the  noby  phase. 

5.  De-nobing  in  the  time  domain  using  wavelets. 

6.  De-nobing  in  the  time  domain  using  the  noby  phase  ^lnd  wavelets. 
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Figure  5.1  Clean  speech  and  noisy  speech  (6db  and  Odb  SNRs). 
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7.  De-noising  in  the  frequency  domain  using  wavelets. 

8.  De-noising  in  the  frequency  domain  using  the  noisy  phase  and  wavelets. 

The  total  number  of  de-noised  signals  without  wavelets  (i.e.,  using  Steig’s  criteria)  is  56.  Since  there 
are  two  thresholding  techniques  (STT  and  HTT),  seven  different  noisy  signals,  and  four  different 
methods  that  don't  involve  wavelets.  The  total  number  of  de-noised  signals  vdth  wavelets  (i.e., 
using  Donoho’s  criteria)  is  168,  using  two  thresholding  techniques  (STT  and  HTT),  seven  different 
noisy  signals,  four  different  methods  that  involve  wavelets,  and  three  different  wavelets  used  (db20, 
db6,  and  coiflets(6)).  Hence,  the  total  number  of  files  studied  is  224. 

5.1.S  Quantitative  analysis.  Since  the  soft  thresholding  technique  (STT)  pulls  towards 
zero  every  single  voiced  sample  and  the  hard  thresholding  technique  (HTT)  pulls  towards  zero  only 
the  voiced  elements  below  the  hard  threshold  (t*"’"'*)  in  absolute  value,  theoretically,  we  expect 
that  the  energy  of  the  de-noised  signal  under  the  STT  to  be  less  than  the  energy  of  the  de-noised 
signal  under  the  HTT.  In  order  to  quantify  this  result,  the  total  squared  error  between  the  de-noised 
signal  and  the  noisy  signal,  using  the  STT  technique,  is  defined  as 

=  (5-1) 

n=0 

where  N  is  the  total  number  of  samples  of  the  speech  signal  under  analysis  (i.e.,  N  =  31200). 
Similarly,  the  HTT  total  error  is  defined  as 

Error^g”1^  =  .  (5.2) 

n=0 

Both  ErroTg^^  and  Error'^lp  measure  the  closeness  of  the  de-noised  signal  to  the  noby 
signal.  Ideally,  we  want  to  be  as  far  away  as  possible  from  the  noisy  signal,  and  still  preserve  the 
intelligibility  of  the  de-noised  speech  signal.  The  experiments  illustrate  that  (see  figures  F.l  and 
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F.3) 


Error^P  >  Error'^.  (5.3) 

In  fact,  because  of  the  definitions  of  the  STT  and  the  HTT,  the  use  of  the  STT  removes 
noise  from  all  samples,  while  the  use  of  the  HTT  removes  noise  only  from  certun  samples.  For  this 
reason,  the  de-noised  speech  signal  under  the  HTT  has  more  remaining  noise  than  the  de-noised 
speech  signal  under  the  STT.  In  all  experiments  and  for  all  seven  noisy  speech  signals  analyzed, 
figures  F.l  and  F.3,  iUustrate  the  fact  that  the  STT  outperforms  the  HTT  with  respect  to  the  total 
squared  error  between  the  de-noised  signals  and  the  noisy  signals. 

Since  the  purpose  of  our  de-noising  technique  is  to  attenuate  the  effect  of  the  noise,  we  would 
like  the  de-noised  speech  signals  to  be  as  close  to  the  clean  signal  as  possible.  In  order  to  quantify 
this  result,  the  total  squared  error  between  the  de-noised  signal  and  the  clean  signal,  using  the  STT 
technique,  is  defined  as 

f7rror|‘^5,”  =  E  •  (5-4) 

n=0 

Similarly,  the  HTT  total  error  is  defined  as 

Errartr-f  =  .  (5.5) 

n=0 

Both  Error and  Error^^f  measure  the  closeness  of  the  de-noised  signal  to  the  clean 
signal.  Ideally,  we  want  the  de-noised  speech  signal  to  be  as  close  as  possible  to  the  clean  signal, 
and  still  preserve  the  intelligibility  of  the  de-noised  speech  signal.  Both  theory  and  experiments 
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prove  that  (see  figures  F.2  and  F.4) 


Errors^  ^  JSrrorg^f.  (5.6) 

Again,  in  all  experiments  and  for  all  seven  noisy  speech  signals  analyzed,  figures  F.2  and  F.4, 
illustrate  the  fact  that  the  STT  outperforms  the  HTT  with  respect  to  the  total  squared  error 
between  the  de-noised  signals  and  the  clean  signal. 

S.1.4  Qualitative  Analysis  Of  The  Informal  Listening  Tests.  The  qualitative  analysis  of 
the  de-noised  speech  signals,  depends  on  many  factors.  In  order  to  understand  the  advanti^es  and 
disadvantages  of  each  of  the  eight  methods,  described  earlier,  we  chose  to  study  two  noisy  signals 
with  signal  to  noise  ratios  Odb  and  6db,  respectively.  The  Odb  signal  represents  a  noisy  speech 
signal  with  a  relatively  high  level  of  noise,  while  the  6db  signal  represents  a  noisy  speech  signal 
with  a  relatively  low  level  of  noise. 

5. 1.4.1  Effects  Of  STT  vs.  HTT.  In  order  to  study  the  effects  of  the  STT  vs.  the 
HTT,  we  randomly  selected  a  jury  of  six  students  (four  males  and  four  females),  considered  to 
be  untrained  listeners.  We  presented  to  these  listeners  two  groups  of  speech  signals;  group  A  has 
speech  signals  processed  using  the  STT  method  and  group  B  has  speech  signals  processed  using 
the  HTT  method.  Each  group  has  two  sets  of  de-noised  speech  signals,  where  the  original  noisy 
speech  signals  have  SNRs  of  Odb  and  6db.  Each  set  has  speech  data  processed  using  the  foUowing 
speech  de-noising  systems  (SDS): 

1.  De-noising  in  the  time  domain. 

2.  De-noising  in  the  time  domain  using  the  noby  phase. 

3.  De-noising  in  the  frequency  domain. 

4.  De-noising  in  the  frequency  domain  using  the  noisy  phase. 

5.  De-noising  in  the  time  domain  using  wavelets. 
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6.  De-noising  in  the  time  domain  using  the  noisy  phase  and  wavelets. 

7.  De-noising  in  the  frequency  domain  using  wavelets. 

8.  De-noising  in  the  frequency  domain  using  the  noisy  phase  and  wavelets. 

We  asked  the  students  to  listen  to  each  speech  signal  from  group  A  and  compare  it  with  its 
counterpart  in  group  B  (e.g.,  Odb  of  group  A  using  SDS-1  vs.  Odb  of  group  B  using  SDS-1).  All 
the  students,  concluded  that  the  STT  method  has  less  remaining  noise  and,  hence,  it  is  easier  to 
listen  to  the  STT-processed  speech  signals  than  the  HTT-processed  speech  signals.  For  this  reason, 
we  chose  to  continue  experimenting  with  only  the  speech  signals  produced  by  the  STT  method. 

5. 1.4. 2  Effects  Of  Preserving  The  Noisy  Phase  Using  STT.  Based  on  the  results 
of  the  STT  vs.  the  HTT  experiment,  above,  and  in  order  to  study  the  effects  of  the  phase,  we 
presented  to  the  same  jury  of  students,  two  groups  of  speech  signals  processed  using  the  STT 
method;  group  A  has  de-noised  speech  data  processed  without  restoration  of  the  noisy  phase  and 
group  B  has  de-noised  speech  data  with  restoration  of  the  noisy  phase.  Each  group  has  two  sets  of 
de-noised  speech  signals,  where  the  original  noisy  speech  signals  have  SNRs  of  Odb  and  Odb.  Each 
set  has  speech  data  processed  using  the  following  speech  de-noising  systems  (SDS): 

A.  No  preservation  of  the  noisy  phase: 

1.  De-noising  in  the  time  domain. 

2.  De-noising  in  the  frequency  dommn. 

3.  De-noising  in  the  time  dommn  using  wavelets. 

4.  De-noising  in  the  frequency  domain  using  wavelets. 

B.  Preservation  of  the  noisy  phase: 

1.  De-noising  in  the  time  domain  using  the  noby  phase. 

2.  De-noising  in  the  frequency  domain  using  the  noisy  phase. 

3.  De-noising  in  the  time  domain  using  the  noisy  phase  and  wavelets. 

4.  De-noising  in  the  frequency  domain  using  the  noisy  phase  and  wavelets. 
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We  asked  the  students  to  listen  to  each  speech  signal  from  group  A  and  compare  it  with  its 
counterpart  in  group  B  (e.g.,  Odb  of  group  A  using  SDS-1  vs.  Odb  of  group  B  using  SDS-1).  All  the 
students,  concluded  that  intelligibility  of  group  B  is  much  better  than  the  intelegibility  of  A  and 
it  is  easier  to  listen  to  the  speech  signals  processed  with  noisy  phase  restoration  than  to  listen  to 
the  speech  signals  processed  without  noisy  phase  restoration.  For  this  reason,  we  chose  to  continue 
experimenting  with  only  the  speech  signals  produced  using  both  the  STT  method  and  the  phase 
restoration  technique. 

5. 1.4-3  Effects  Of  The  Time  vs.  Fourier  Domains  On  Speech  De-noising  Using  STT 
And  Noisy  Phase  Restoration.  Based  on  the  results  of  the  last  two  sections,  we  chose  to  continue 
experimenting  with  speech  data  processed  using  both  the  STT  and  the  phase  restoration  techniques. 
In  order  to  study  the  effect  of  the  time  domain  vs.  the  Fourier  domain,  we  presented  to  the  jury 
of  students,  two  groups  of  de-noised  speech  signals;  group  A  has  speech  data  processed  in  the  time 
dommn  and  group  B  has  speech  data  processed  in  the  Fourier  dommn.  Both  groups  have  speech 
data  processed  using  both  the  STT  and  phase  restoration  techniques.  Each  group  has  two  sets  of 
de-noised  speech  signals,  where  the  original  noisy  speech  signals  have  SNRs  of  Odb  and  Odb.  Each 
set  has  speech  data  processed  using  the  following  speech  de-noising  systems  (SDS): 

A.  Time  domain: 

1.  De-noising  in  the  time  domain  using  the  noisy  phase. 

2.  De-noising  in  the  time  domain  using  the  noisy  phase  and  wavelets. 

B.  Fourier  domsdn: 

1.  De-noising  in  the  frequency  domain  using  the  noisy  phase. 

2.  De-noising  in  the  frequency  domain  using  the  noisy  phase  and  wavelets. 

We  asked  the  students  to  listen  to  each  speech  signal  from  group  A  and  compare  it  with  its 
counterpart  in  group  B  (e.g.,  Odb  of  group  A  using  SDS-1  vs.  Odb  of  group  B  using  SDS-1).  All  the 
students,  concluded  that  the  intelligibility  of  group  B  is  much  better  than  the  intelligibility  of  A 
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and  it  is  easier  to  listen  to  the  speech  signals  processed  in  the  Fourier  domsdn  than  to  listen  to  the 
speech  signals  processed  in  the  time  domain.  For  this  reason,  we  chose  to  continue  experimenting 
with  only  the  speech  signals  produced  in  the  Fourier  domain  using  both  the  STT  method  and  the 
phase  restoration  technique. 

5. 1.4-4  Effects  Of  Wavelets  On  Speech  De-noising  In  The  Fourier  Domain.  Based 
on  the  results  of  the  last  three  sections,  we  chose  to  continue  experimenting  with  speech  data 
processed  in  the  Fourier  domidn  using  both  the  SST  and  the  phase  restoration  techniques.  In  order 
to  study  the  effect  of  using  wavelets  vs.  not  using  wavelets  in  the  Fourier  domain,  we  presented  to 
the  jury  of  students,  two  groups  of  de-noised  speech  signals;  group  A  has  speech  data  processed  in 
the  Fourier  domain  without  using  wavelets  and  group  B  has  speech  data  processed  in  the  Fourier 
domain  using  wavelets.  Both  groups  have  speech  data  processed  in  the  Fourier  domain  using  the 
STT  and  phase  restoration  techniques.  Each  group  has  two  sets  of  de-noised  speech  signals,  with 
SNRs  of  Odb  and  6db.  Each  set  has  speech  data  processed  using  the  following  speech  de-noising 
systems  (SDS): 

A.  wavelets: 

1.  De-noising  in  the  frequency  domain  using  the  noisy  phase  and  wavelets. 

B.  No  wavelets: 

1.  De-noising  in  the  frequency  dommn  using  the  noisy  phase. 

We  asked  the  students  to  listen  to  each  speech  signal  from  group  A  and  compare  it  with 
its  counterpart  in  group  B  (e.g.,  Odb  of  group  A  using  SDS-1  vs.  Odb  of  group  B  using  SDS-1). 
All  the  students,  concluded  that  for  Odb,  the  de-noised  speech  signals  from  both  groups  are  very 
close  in  terms  of  intelligibility,  however,  for  Odb,  the  intelligibility  of  group  A  is  much  better  than 
the  intelligibility  of  B.  Since  our  jury  is  forced  to  choose  only  one  group,  all  students  chose  group 
A  because  they  concluded  that  it  is  easier  to  listen  to  the  speech  signals  processed  in  the  Fourier 
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domain  using  vmvelets  than  to  listen  to  the  speech  signals  processed  in  the  Fourier  domain  without 
using  wavelets. 

5.1.5  Spectrum  Analyaia  OfDe-noiaed  Speech  Data  Using  The  STT.  We  mentioned  earlier 
that  the  production  of  speech  through  the  vocal  tract  is  characterized  as  either  voiced  or  unvoiced. 
The  unvoiced  speech  signals,  the  fricatives,  behave  like  noise  and  have  high  energy  above  about 
3kHz  and  relatively  very  low  energy  below  3kHz  (19).  On  the  other  hand,  most  voiced  speech  is 
located  at  bands  of  frequencies  below  3kHz.  The  pitch  and  the  first  formant  are,  in  general,  located 
below  500Hz,  while  the  second  and  third  formants  are  located  between  500Hz  and  3kHz.  The 
formant  frequencies  are  important  because  most  of  the  voiced  speech  characteristics  (i.e.,  pitch)  are 
based  on  the  location  of  these  frequencies.  In  order  to  study  the  effects  on  the  frequency  content 
of  our  speech  signals,  we  choose  three  different  wavelets  and  four  different  de-noising  techniques. 
We  generated  two  sets  of  spectrograms,  wide-band  and  narrow-band  (for  clean  and  noisy  speech, 
only).  Wide-band  spectrograms  have  a  small  analysis  window,  therefore,  the  frequency  resolution 
is  low,  while  the  time  resolution  is  high.  On  the  other  hand,  narrow-band  spectrograms  have  a 
large  analysis  window,  therefore,  the  frequency  resolution  is  high,  while  the  time  resolution  is  low. 
Each  set  (narrow-band  and  mde-band)  of  spectrograms  includes  : 

1.  Clean  speech. 

2.  Noisy  speech  Odb  (relatively  high  noise  level). 

3.  Noisy  speech  6db  (relatively  low  noise  level). 

4.  for  each  of  the  three  wavelets  used  (db20,  db6,  and  coiflets(6))  and  for  e£ich  of  the  noise 
levels  used  (Odb  and  Odb),  we  studied  the  frequency  content  of  the  de-noised  speech  data  using 
shrinlo^e  with  the  following  speech  de-noising  systems  (SDS): 

a.  De-noising  in  the  time  domain. 

b.  De-noising  in  the  time  dommn  using  wavelets. 
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c.  De-noising  in  the  frequency  domain  using  the  noisy  phase. 

d.  De-noising  in  the  frequency  domain  using  the  noisy  phase  and  wavelets. 

The  spectrograms  of  the  clean  signal  show  very  clearly  the  pitch,  the  first,  second,  and  third 
formants.  These  frequencies  have  high  energy  below  3kHz  (see  figure  G.l).  Despite  the  addition  of 
the  white  Gaussian  noise,  the  spectrograms  of  the  noisy  speech  signals  with  signal-to-noise  ratios 
of  Odb  and  6db,  show  that  the  pitch,  the  first,  second,  and  third  formants  are  still  dominant  below 
the  3kHz  frequency.  However,  the  effect  of  white  Gaussian  noise  can  be  clearly  seen  throughout 
the  spectrograms.  In  fact,  since  the  white  Gaussian  noise  is,  in  general,  a  broad-band  signal,  the 
spectrogram  indicates  high  energy  at  all  frequency  bands  (see  figures  G.2  and  G.3). 

5. 1.5.1  Effects  Of  Stein’s  Criteria  On  Time  De-noising  vs.  Fourier  De-noising  Using 
Noisy  Phase  Restoration.  De-noising  in  the  time  domain  using  both  Stein’s  criteria  and  the 
noisy  phase,  works  relatively  well  for  high  signal-to-noise  ratios.  In  fact  when  the  noise  level  is  very 
low  (i.e.,  6db),  most  of  the  signal’s  formant’s  structure  below  the  3kHz  frequency  is  stiU  preserved; 
however,  a  lot  of  high  frequency  noise  is  still  present  (see  figure  H.l).  When  the  noise  level  increases, 
the  noisy  speech  signal  looks  like  white  Gaussian  noise  and  the  application  of  Stein’s  criteria  tends 
to  eliminate  most  of  the  speech  signal  itself,  and  hence  affecting  most  of  the  formant  frequencies. 
On  the  other  hand,  de-noising  in  the  Fourier  dommn  using  Stein’s  criteria  and  preserving  the  noisy 
phase,  works  much  better  because  of  the  fact  that  the  noise  is  split  between  the  real  and  imaginary 
parts  of  the  Fourier  transform.  Since  the  noisy  phase  is  restored,  most  of  the  noisy  speech  structure 
(pitch  and  formants)  is  restored  back  to  the  de-noised  speech  signal.  Despite  these  improvements, 
when  the  noise  level  is  relatively  high,  the  real  and  ima^nary  parts  become  very  noisy  and  Stein’s 
criteria  ciffects  the  true  structure  of  the  signal  (see  figure  H.2). 

5. 1.5.2  Effects  of  The  wavelet  Choice  On  De-noising  In  The  Time  Domain.  By  us¬ 
ing  wavelets,  we  decompose  a  noisy  signal  into  bands  of  frequencies  and  then  we  de-noise  each  band 
separately.  This  process  is  potentially  more  powerful  than  the  methods  that  don’t  use  wavelets. 


5-11 


However,  the  choice  of  the  right  wavelet  with  good  filtering  characteristics  is  very  important.  We 
choose  three  different  wavelets:  db6,  coiflets(6),  and  db20.  Since  the  wavelet  transform  is  a  filtering 
operation,  the  effect  of  the  filtering  characteristics  of  the  wavelet  become  very  crucial.  The  spec¬ 
trograms  for  both  Odb  and  6db  using  wavelets  in  the  time  domain  show  that  there  is  an  aliasing 
effect  for  both  db6  and  coiflets(6)  (see  figures  I.l  and  1.2).  The  reason  for  this  aliasing  is  due  to 
the  f^lct  that  the  Fourier  transforms  of  both  db6  and  coiflets(6)  have  a  lot  of  high  energy  side  lobes 
which  cause  the  filtering  qualities  of  these  wavelets  to  be  of  low  importance.  On  the  other  hand, 
the  spectrograms  of  db20  show  no  aliasing  at  all,  which  make  db20  a  very  good  wavelet  to  use 
in  speech  processing  (see  figure  1.3).  However,  because  of  the  fact  that  the  cubic  splines  are  not 
compactly  supported  wavelets,  their  use  in  practice  requires  an  approximations  which  affects  the 
general  behavior  of  the  spline  wavelets.  The  best  results  in  terms  of  total  square  error,  intelligibility, 
and  the  preservation  of  formant  frequencies  ,  were  given  by  db20  which  is  a  compactly  supported 
wavelet  vrith  a  very  good  filtering  quality  (i.e.,  very  small  side  lobes). 

5. 1.5 .3  Effects  Of  The  Wavelet  Choice  On  De-noising  In  The  Fourier  Domain  With 
Noisy  Phase  Restorations.  Since  the  de-noising  process  is  carried  out  in  the  Fourier  domain,  the 
noise  level  is  split  between  the  real  and  imaginary  parts.  These  are  then  wavelet  transformed  and 
decomposed  into  bands  of  frequencies  in  order  to  eliminate  most  of  the  noise  from  each  band.  By 
restoring  the  noisy  phase  and  applying  the  wavelet  shrinkage  to  both  the  real  and  imaginary  parts 
of  the  Fourier  transform  of  the  noisy  signal,  the  effect  of  aliasing  seems  to  decrease,  even  for  db6 
and  coiflets(6)  (see  figures  J.l  and  J.2).  However,  for  the  same  reasons  described  in  the  previous 
section,  most  of  the  formants’  structure  of  the  noisy  speech  signal  is  preserved  when  using  db20 
(see  figure  J.3). 
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5.2  Conchiaions 


In  this  chapter,  we  presented  the  results  of  several  speech  de-noising  experiments  on  various 
noisy  speech  data  (-lOdb  to  lOdb).  In  general,  we  saw  that  the  performance  of  the  speech  de-noising 
systems  using  both  Fourier  and  wavelets  resulted  in  intelligible  speech  even  for  low  signal-to-noise 
ratios  (SNR).  The  use  of  the  noisy  phase  improved  both  the  quality  and  the  intelligibility  of  the 
de-noised  speech  signals.  The  use  of  the  soft  thresholding  technique  (STT),  in  the  wavelet-Fourier 
domain,  proves  to  be  a  very  good  technique  to  use  in  the  enhancement  of  noisy  speech  data. 
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VI.  Conclusions  and  Recommendations 


6.1  Introduction 

In  this  chapter,  we  present  both  the  conclusions  of  this  research  and  some  of  the  recommen¬ 
dations  for  future  research  in  the  area  of  enhancing  noisy  speech  data.  We  summarize  the  major 
points  and  evaluate  how  well  the  objectives  of  this  thesis  were  met. 

6.2  Main  Conclusions  Of  The  Thesis 

This  thesis  is  successful  in  producing  several  speech  de-noising  systems  (SDS)  in  the  time, 
Fourier,  and  the  wavelet  domains.  Without  the  use  of  wavelets,  the  SDS  systems  perform  relatively 
well  and  produce  intelligible  speech  when  the  noise  level  is  low  (SNR  =  6db).  These  systems  are 
comparatively  fast  (since  they  don’t  require  the  wavelet  transform)  and  can  be  used  to  produce 
comparable  results  to  the  wavelet-based  SDSs,  for  low  levels  of  noise  (e.g.,  SNR  =  6db).  However, 
when  the  noise  level  is  high  (SNR  =  Odb),  the  non-wavelet  SDSs  do  not  produce  intelligible  speech. 
In  fact,  without  using  wavelets,  the  application  of  either  the  soft  thresholding  technique  (STT)  or 
the  hard  thresholding  technique  (HTT)  to  noisy  speech  data,  with  noise  levels  below  SNR  =  6db, 
produced  de-noised  speech  data,  that  is  worst  to  listen  to,  than  the  noisy  speech  data  itself. 

The  application  of  Stein’s  criteria  to  noisy  voiced  speech  using  wavelets  on  the  time  data 
(Donoho’s  technique)  did  not  produce  intelligible  speech  for  all  noise  levels  (i.e.,  -lOdb  to  lOdh). 
In  fact,  this  method  produced  a  very  distorted  de-noised  speech  with  a  constant  disturbing  sound, 
which  is  mainly  due  to  the  non-linear  effect  of  the  thresholding  techniques.  The  use  of  the  noisy 
phase  produced  a  slight  improvement  of  the  intelligibility  of  speech.  Finally,  the  use  of  the  wavelet 
shrinkage  techniques  applied  to  the  Fourier  domain  with  noisy  phase  restoration  proves  to  be  a 
powerful  technique  to  enhance  speech  data  degraded  by  additive  white  Gaussian  noise.  In  fact, 
when  using  a  wavelet  with  good  filtering  characteristics  (e.g.,  db20),  the  formants’  structure  and 
intelligibility  can  be  considerably  preserved.  This  new  technique  involves  a  lot  of  calculations 
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due  to  the  Fourier  transforms,  the  wavelet  transforms,  and  the  phase  calculations.  However,  the 
intelligibility  of  the  de-noised  speech  data,  outperformed  all  the  other  de-noising  systems,  especially 
when  the  noise  levels  were  high  (SNRs  below  6db). 

The  combination  of  the  noisy  phase  and  the  wavelet-Fourier  technique  produced  the  best 
results  (intelligibility)  because  it  involves  a  de-noising  process  on  two  less  noisy  sets  of  data;  the 
real  and  imaginary  parts  of  the  Fourier  transform  of  the  noisy  signal.  The  Fourier  transform  splits 
the  noise  level  between  the  real  and  imaginary  parts.  De-noising  the  wavelet  details  of  both  the 
reid  and  imaginary  parts,  reduces  the  noise  at  each  level  of  decomposition,  resulting  in  a  large 
amount  of  noise  being  taken  from  both  the  real  and  imaginary  parts.  After  this  de-noising  process, 
the  combination  of  the  real  and  imaginary  parts  produces  a  cleaner  amplitude  which  is  further 
combined  with  the  noisy  phase,  wherein  important  speech  information  is  saved.  Most  importantly, 
this  research  illustrated  the  fact  that  the  phase  has  the  potential  to  preserve  a  lot  of  the  underlying 
speech  formants’  structure  and  that,  in  order  to  avoid  aliasing  and  still  preserve  intelligibility,  it  is 
very  important  to  choose  a  wavelet  with  very  good  filtering  characteristics. 

6.3  Evaluation  Of  The  Thesis  Objectives 

In  terms  of  the  four  objectives  mentioned  in  the  first  chapter,  in  this  thesis,  we  were  able 
to  apply  both  wavelets  and  the  soft  thresholding  technique  (STT)  to  enhance  noisy  speech  data. 
The  speech  de-noising  systems  (SDS)  can  only  be  applied  to  the  voiced  portions.  The  unvoiced 
and  silent  portions  are  not  to  be  processed  using  the  SDSs  discussed  in  the  fourth  chapter.  These 
portions  tend  to  disappear  when  processed  by  the  SDSs,  and  hence,  we  can  use  our  SDSs  as  detector 
systems  for  the  unvoiced,  voiced,  and  silent  speech  portions  by  using  a  single  window  on  the  entire 
speech  utterance.  The  use  of  the  noisy  phase,  combined  with  both  wavelets,  Fourier,  and  the  STT 
technique,  considerably  improved  intelligibility.  The  use  of  wavelets  with  thresholding  is  important. 
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however,  in  order  to  obtain  good  results,  the  choice  of  a  wavelet  with  good  filtering  characteristics 
(no  high  energy  side-lobes)  have  a  direct  effect  on  the  quality  of  the  de-noised  speech  data. 

6.4  Recommendations 

Further  investigations  in  the  area  of  noise  cancellation  using  both  Fourier  and  wavelets  can 
further  the  results  of  this  research.  Many  of  the  methods  described  in  this  work  can  be  further 
explored,  improved  (i.e.,  hard  or  soft  thresholding  of  the  approximations  where  the  pitch  of  voiced 
speech  resides),  and  compared  to  our  results.  The  STT  and  HTT  methods  can  be  used  to  develop 
a  pre-processing  system  to  detect  the  voiced,  unvoiced,  and  silent  speech  portions.  Since  this 
research  assumes  that  the  location  of  voiced,  unvoiced,  and  silent  portions  are  known,  the  STT  or 
HTT  based  detector  system,  can  complete  our  de-noising  system. 

One  of  the  main  concerns  of  our  speech  de-noising  algorithm  is  speed.  Due  to  the  fact 
that  our  algorithm  uses  the  Fourier  transform,  the  wavelet  transform,  and  the  STT  or  the  HTT 
techniques,  the  results  tend  to  take  considerable  time  to  produce  (an  average  of  8  minutes  on  a 
Sparc2  station  with  a  single  processor).  In  order  to  reduce  the  algorithm  execution  time,  we  suggest 
implementation  of  the  algorithm  in  a  parallel  machine  and  we  need  also  to  derive  a  better  way  of 
finding  the  thresholds  that  minimize  the  SURE  functions  of  either  the  HTT  or  the  STT  methods. 
In  fact,  the  SURE  function  involves  many  loops  and  many  comparisons  that  use  each  element  of  the 
noisy  data.  This  means  that  as  the  number  of  data  points  increjises,  the  execution  time  increases 
exponentially. 

Finally,  the  results  of  this  research  illustrated  the  need  for  a  better  metric  system  for  analyzing 
the  performance  of  de-noising  speech  data.  Most  of  the  speech  de-noising  systems  produced  speech 
data  with  low  error  with  respect  to  the  clean  speech  signal,  however,  they  do  not  have  good 
intelligibility. 
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Most  importantly,  since  we  are  using  both  wavelets  and  Fourier  transforms,  most  of  the 
processing  can  be  implemented  using  parallel  processing  to  speed  up  the  results.  Finally,  the  SURE 
functions  should  be  further  studied  in  order  to  find  an  effective  criteria  to  choose  the  thresholds 
that  minimize  the  SURE  functions  without  checking  all  the  samples  available. 
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Appendix  A.  Wavelet  Coefficients 


This  Appendix  contains  both  the  h  filter  coefficients  and  the  Fourier  transforms  for  each  of  the 
three  wavelets  used  in  this  thesis:  db€,  coiflet(6),  and  db20.  The  Fourier  transforms  show  clearly 
that  the  approximation  filters,  h,  are  low-pass  filters,  while  the  detail  filters,  g,  are  high-pass  filters. 
These  filters  are  used  in  the  discrete  wavelet  transform  (DWT)  to  divide  the  frequency  spectrum,  of 
the  signal  under  analysis,  into  bands  which  have  a  constant  bandwidth  on  a  logarithmic  frequency 
scale  (in  our  case  the  bandwidths  change  by  a  factor  of  2,  the  dilation  factor).  Observe  that  the 
Fourier  trauKsforms  (<;  and  h  filters)  of  both  db6  and  coifiet(6)  do  not  have  a  sharp  roll-offs,  while 
those  of  the  Fourier  transform  of  db20  are  sharper  than  those  of  db6  and  coiflet8(6). 
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T^ble  A.l  Scaling  function  coifficients  of  db6 


coefficientB  of  the  filter  h 


-.07273261951 

.33789766250 

.85257202020 

.38486484700 

-.07273296500 

-.01565572800 


l^ble  A.2  Scaling  function  coifficients  of  coiflet(6). 


coefficients  of  the  filter  h 


.026670057901 

.188176800078 

.527201188932 

.688459039454 

.281172343661 

-.249846424327 

-.195946274377 

.127369340336 

.093057364604 

-.071394147166 

-.029457536822 

.033212674059 

.003606553567 

-.010733175483 

.001395351747 

.001992405295 

-.000685856695 

-.000116466855 

.000093588670 

-.000013264203 


Table  A.3  Scaling  function  coifficients  of  db20. 
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h  filter  (iqqntndnutiaQ) 


Aaq>litiide  Of  The  FFT  Of  h 


Figure  A.l  Fourier  transforms  of  the  h  and  g  filters  of  dbfi. 
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Wavelet:  coiflet_6 


Wavelet:  db20 


Figure  A.3  Fourier  transforms  of  the  h  and  g  filters  of  db20. 
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Appendix  B.  Wavelets  And  Their  Fourier  Transform 


This  Appendix  contains  the  plots,  on  linear  scales,  of  the  three  wavelets  used  in  this  thesis. 
All  figures  have  identical  time  and  frequency  axes  in  arbitrary  units.  The  amplitude  of  the  Fourier 
transform  of  all  three  wavelets  represent  band-pass  filters.  Observe  that  the  amplitudes  of  the 
Fourier  transforms  of  db6  and  coiflet(6),  have  many  high  energy  side-lobes,  while  the  amplitude  of 
the  Fourier  transform  of  db20  has  very  little  or  no  side-lobes  at  all.  These  filtering  characteristics  of 
the  three  wavelets  affect  the  quality  of  the  speech  de-noising  results  (see  spectrograms  of  Appendix 
J  through  L). 
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Appendix  C.  Wavelet  Shrinkage  of  Sinewave 


This  appendix  contiuns  the  de-noising  results  of  a  sinewave  of  frequency  2Hz.  We  generated 
two  signals,  each  contains  512  samples.  The  first  signal  is  a  2Hz  sinewave  and  the  second  is  a  white 
Gaussian  noise  of  zero  mean  and  variance  of  =.  We  added  the  white  Gaussian  noise  to  the 
clean  sinewave  and  then  applied  the  soft  thresholding  technique  (STT)  to  both  the  clean  and  noisy 
sinewaves.  The  de-noising  process  was  carried  in  the  wavelet  domain  using  db20.  The  discrete 
wavelet  transform  (DWT)  of  the  clean  2Hz  sinewave,  shows  high  energy  details  at  the  seventh  and 
eighth  levels,  while  the  DWT  of  the  noisy  sinewave  shows  high  energy  details  at  all  levels  (see 
figures  C.l  and  C.2).  The  high  energy  detmls  of  the  early  levels  of  decomposition  (i.e.,  levels  1,  2, 
3,  and  4)  of  the  noisy  sinewave,  are  mainly  due  to  noise.  We  applied  the  STT  to  both  the  clean 
and  noisy  sinewaves,  sep<irately. 

The  clean  signal  was  processed  using  the  STT  method  and  a  variance  value  of  =  1.  Figure 
C.3  shows  that  the  detmls  of  the  clean  signal  are  still  preserved  and  figure  C.4  shows  the  near 
perfect  reconstruction  of  the  the  clean  sinewave.  Observe,  the  amplitude  of  the  Fourier  transform 
of  the  STT  processed  clean  signal  is  almost  identical  to  that  of  the  original  clean  sinewave  (see 
figure  C.5).  Notice  the  phase  distortions  caused  by  the  non-linear  processing  of  this  sinewave  (see 
figure  C.6). 

The  application  of  the  STT  to  the  noisy  sinewave  has  eliminated  most  of  the  high  frequency 
details  which  are  mainly  due  to  noise.  Figure  C.7  shows  that  the  high  energy  detmls  of  the  early 
levels  of  decomposition  (i.e.,  levels  1,  2,  3,  and  4)  of  the  noisy  sinewave,  have  been  completely 
eliminated,  while  the  details  of  the  seventh  and  eighth  levels  of  decomposition,  which  characterize 
the  clean  sinewave,  are  still  preserved.  The  reconstructed  sinewave,  see  figure  C.8,  is  very  close  to 
the  clean  sinewave.  Observe,  the  effects  of  the  STT  on  both  the  amplitude  and  the  phase  of  the 
Fourier  transform  of  the  noisy  and  reconstructed  sinewaves  (see  figures  C.9  and  C.IO). 
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Wavelet:  db20 


Sinewave:  Requeocy  ■  2  And  Variance  »  1 


Qean  Details 


Wavelet  db20 


Sioewave:  Rrequeacy  a  2  And  Variance  -  1 


Figure  C.2  DetaUs  of  the  noisy  sinewave  (2Hz). 
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Noisy  Details 


Wavelet;  db20 


Sinewave;  Ftequeocy  ■  2  And  Variance  s  1 


Figure  C.3  Details  of  the  processed  clean  sinewave  (2Hz)  after  the  STT  =  1). 
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Soft  Thresholding  Technique  (STT) 


Wavelet:(fi>20 


Sinewtve:  IVeqaeiicy  «  2  And  Vvuaoe  *  1 


Figure  C.4  Clean  sinewave  (2Hz)  after  the  STT  (<r’  =  1). 
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Wavelet:db20 


Sinewave:  Fmiuency  «  2  And  Variance  ■  1 


C-6 


NotiySIgiial  Ortgbul  Signal 
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Sinewtve:  Reqnency  >  2  And  Virianoe  ■  1 


Figure  C.6  Phase  of  the  FFT  of  the  clean  sinewave  (2Hz)  after  the  STT  (cr*  =  1). 
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Sinewave:  Frequeocy  »  2  And  Variance  s  1 


Sinewave:  I¥eiitteacy  «  2  And  Variance  ■  1 


Figure  C.8  Noisy  sinewave  (2Hz)  after  the  STT  (o’*  =  1). 
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Wtvelet:db20 


Sinewtve:  Rwineocy  ■  2  And  Variance  ■  1 


Figure  C.9  Amplitude  Of  the  FFT  of  the  noisy  sinewave  (2Hz)  after  the  STT  (tr*  = 


Wavelet;  db20 


Sinewtve:  Frequency  >  2  And  Variance  ^  1 


Figure  C.IO  Phase  Of  the  FFT  of  the  noisy  sinewave  (2Hz)  after  the  STT  (cr*  =  1). 
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Appendix  D.  Effect  Of  Wavelet  Shrinkage  On  White  Gaussian  Noise  and  Unvoiced 

Speech 

This  Appendix  contains  the  plots  of  a  white  Gaussian  noise  signal  and  an  unvoiced  speech 
signal  before  and  after  the  application  of  the  soft  thresholding  technique  (STT).  The  de-noising 
method  uses  wavelets  in  the  time  domain  without  noisy  phase  restoration  (using  db20).  512  samples 
of  a  white  Gaussian  noise  signal  with  zero  mean  and  =  1  was  generated.  This  signal  was 
processed  using  the  STT.  Since  the  SURE  function  estimates  the  mean  of  an  independent  and 
normally  distributed  random  signal,  the  expected  result  is  a  signal  with  512  zeros  (i.e.,  the  noise 
has  zero  mean).  The  white  Gaussian  plots  illustrate  the  fact  that  the  application  of  the  STT  to 
the  white  Gaussian  noise  is  very  close  to  zero.  Observe,  the  Fourier  transform  of  the  noise  has  high 
energy  throughout  the  entire  energy  spectrum.  Also,  notice  that  the  high  decomposition  detail 
levels  (i.g.,  1,2,  and  3)  filter  most  of  the  white  Gaussian  noise. 

The  second  set  of  plots,  deals  with  both  clean  and  noisy  unvoiced  speech.  The  plots  illustrate 
the  fact  that  unvoiced  speech  is  treated  as  white  Gaussian  noise.  In  fact,  when  using  the  STT  to 
de-noise  a  clean  unvoiced  speech  signal  (similar  to  the  case  of  a  clean  sinewave),  the  result  is  a  sign£d 
with  zeros  everywhere  (see  figure  D.ll).  Notice  that  the  effects  of  the  STT  on  a  noisy  unvoiced 
speech  signal  are  similar  to  the  effects  of  the  STT  on  white  Gaussian  noise.  We  conclude  then,  that 
the  noisy  unvoiced  speech  data  has  characteristics  similar  to  those  of  white  Gaussian  noise.  In  order 
to  prevent  loosing  all  the  unvoiced  as  well  as  the  silent  speech  portions,  we  chose  not  to  process 
these  portions.  One  important  observation  is  that  both  the  soft  and  hard  thresholding  techniques 
(STT  and  HTT)  can  be  used  as  detectors  for  voiced,  unvoiced,  and  silent  speech  segments. 
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White  Gaussian  Noise  With  Variance  «  1 


Figure  D.l  Details  of  the  white  Gaussian  noise  {cr 


White  Gaussian  Noise  With  Variance  =  1 


Figure  D.2  Details  of  the  processed  white  Gaussian  noise  after  the  STT  =  1). 
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Wavel£t:db20 


White  Gaussian  Noise  With  Variance  >  1 


Figure  D.3  White  Gaussian  noise  after  the  STT  (<r®  =  1). 
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Figure  D.4  Amplitude  Of  the  FFT  of  the  white  Gaussian  noise  after  the  STT  (<t*  =  1). 
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Figure  D.5  Phase  Of  the  FFT  of  the  white  Gaussian  noise  after  the  STT  (<t*  =  1). 
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Wavelet:  db20 


Unvoioed  Speech  With  Noiae  Variance  s  1 


Figure  D.6  Details  of  the  clean  unvoiced  speech. 
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Qean  Details 


Wavelet:  db20 


Uovoioed  Speech  With  Noiae  Variance  >  1 


Figure  D.7  Details  of  the  noisy  unvoiced  speech. 
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Noisy  Details 


Wavelet*  db20 


Unvoiced  Speech  Nbi/w  «^ariance  s  1 


Figure  D.8  Details  of  the  processed  clean  unvoiced  speech  after  the  STT  (o®  =  1). 
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Soft  Thresholding  Technique  (STT) 


Wavelet:  db20 


Unvoiced  Speedi  With  Noise  Variance  =  1 


Figure  D.9  Details  of  the  processed  noisy  unvoiced  speech  after  the  STT  (<r^  =  1). 
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Soft  Thresholding  Technique  (STT) 


Wavelet;db20 


Unvoiced  Speech  Noiae  Variance  -  1 


Figure  D.IO  Noisy  unvoiced  speech  after  the  STT  (er*  =  1). 
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Wavelet;db20 


Unvoiced  Speech  With  Noiae  Variance  >  1 


Figure  D.ll  Clean  unvoiced  speech  after  the  STT  (<r^  =  !)• 
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Figure  D.13  Amplitude  of  the  FFT  of  the  clean  unvoiced  speech  after  the  STT  (ct®  =  1). 
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Wave]et:db20 


Unvoiced  Speech  Widi  Noiae  Variance  ■  1 


Figure  D.14  Phase  Of  the  FFT  of  the  noisy  unvoiced  speech  after  the  STT  (<t^  =  1) 
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Scft  Thresholding  Noisf  Signal  Original  Signal 


Wavel^;db20 


Unvoiced  Speech  With  Noiae  Variance  >  1 


Figure  D.15  Phase  of  the  FFT  of  the  clean  unvoiced  speech  after  the  STT  (<r*  =  1). 
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Appendix  E.  Effect  Of  Wavelet  Shrinkage  On  Voiced  Speech 


This  Appendix  illustrates  the  effects  of  applying  the  STT  to  both  a  noisy  and  a  clean  voiced 
speech  segments.  The  wavelet  decomposition  of  the  clean  voiced  speech  segment  shows  high  energy 
details  at  the  coarser  levels  of  decomposition  (i.e.,  levels  4,  5,  and  6),  while  the  finer  levels  of 
decomposition  (i.e.,  levels  1  and  2)  have  little  or  no  high  energy  details  at  all  (see  figure  E.l).  On 
the  other  hand,  the  noisy  version  (clean  voiced  speech  and  nobe  with  variance  =  1)  of  thb 
voiced  speech  signal,  shows  high  energy  at  all  detail  leveb;  especially  leveb  1  and  2  (see  figure  E.2). 
The  effects  of  the  STT  on  both  the  clean  and  noby  voiced  speech  signals  b  that  most  of  the  high 
frequency  details  are  eliminated  (see  figures  E.3  and  E.4).  Observe  that  the  reconstruction  of  both 
signals,  the  STT  processed  clean  voiced  speech  signal  and  the  STT  processed  noby  voiced  speech 
signal,  are  very  close  to  the  original  voiced  speech  signal.  The  amplitude  of  the  Fourier  transforms 
of  the  reconstructed  signals  show  very  little  high  frequency  components.  Finally,  notice  the  effects 
of  the  non-linear  processing  on  the  phase  (see  figures  E.IO  and  E.9). 
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Voiced  Speech  With  Noise  Variance  »  1 
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Figure  E.3  Details  of  the  processed  clean  voiced  speech  after  the  STT  (a^  =  !)• 
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Soft  Thresholding  Technique  (STT) 
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Figure  E.5  Noisy  voiced  speech  after  the  STT  (<r*  =  1). 
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Figure  E.6  Clean  voiced  speech  after  the  STT  (ff®  =  !)• 
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Voiced  Speech  With  Noiae  Variance  ■  1 


Figure  E.7  Amplitude  Of  the  FFT  of  the  noisy  voiced  speech  after  the  STT  (<t*  =  1) 
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Voiced  Speech  With  Noise  Variance  ■  1 
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Voiced  Speech  With  NoacVarianoo  1 


Figure  E.9  Phase  Of  the  FFT  of  the  noisy  voiced  speech  after  the  STT  (<r*  =  !)• 
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Figure  E.IO  Phase  of  the  FFT  of  the  clean  voiced  speech  after  the  STT  (<r*  =  1). 
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Appendix  F.  Total  Squared  Error  With  Respect  To  Both  The  Clean  And  Noisy 
Speech  Signals  Using  Compactly  Supported  Wavelets 

This  Appendix  contains  bar-charts  showing  the  total  squared  error  (TSE)  between  the  de- 
noised  speech  signals  and  both  the  clean  and  noisy  speech  signal,  using  db6,  coiflet(6),  and  db20. 
We  studied  the  effects  of  both  the  soft  and  hard  thresholding  techniques  (STT  and  HTT)  on  seven 
different  noisy  signals  with  signals-to-noise-ratios  (SNR):  -lOdb,  -6db,  -3db,  Odb,  3db,  6db,  and 
lOdb.  Eight  different  speech  de-noising  systems  (SDS)  have  been  studied: 

1.  WRINP  means  that  the  SDS  uses  the  wavelet  transform  on  the  real  and  imaginary  parts 
of  the  Fourier  transform  of  the  original  noisy  signal.  This  method  reconstructs  the  signal  using  the 
phase  of  the  original  noisy  s^ .  ech  signal. 

2.  WRI  means  that  the  SDS  uses  the  wavelet  transform  on  the  real  and  imaginary  parts 
of  the  Fourier  transform  of  the  original  noisy  signal.  This  method  does  not  use  the  phase  of  the 
original  noisy  speech  signal. 

3.  WTNP  means  that  the  SDS  uses  the  wavelet  transform  of  the  original  noisy  signal  (no 
Fourier  transform).  This  method  reconstructs  the  signal  using  the  phase  of  the  original  noisy  speech 
signal. 

4.  WT  means  that  the  SDS  uses  the  wavelet  transform  of  the  original  noisy  signal  (no  Fourier 
transform).  This  method  does  not  use  the  phase  of  the  original  noisy  speech  signal.  This  method 
is  based  on  Donoho’s  original  work  on  wavelet  shrinkage. 

5.  SRINP  means  that  the  SDS  uses  Stein’s  criteria  directly  on  the  real  and  imaginary  parts 
of  the  Fourier  transform  of  the  original  noisy  signal.  This  method  reconstructs  the  signal  using  the 
phase  of  the  original  noisy  speech  signal.  This  method  resembles  the  spectral  subtraction  developed 
by  Steven  Boll. 

6.  SRI  means  that  the  SDS  uses  Stein’s  criteria  directly  on  the  real  and  imaginary  parts 
of  the  Fourier  transform  of  the  original  noisy  signal.  This  method  does  not  use  the  phase  of  the 
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original  noisy  speech  signal. 

7.  STNP  means  that  the  SDS  uses  Stein’s  criteria  directly  on  the  original  noisy  signal  (no 
Fourier  transform).  This  method  reconstructs  the  signal  using  the  phase  of  the  original  noisy  speech 
signal. 

8.  ST  means  that  the  SDS  uses  Stein’s  criteria  directly  on  the  original  noisy  signal  (no 
Fourier  transform).  This  method  does  not  use  the  phase  of  the  original  noisy  speech  signal. 

The  bar-charts  (SNRs  Odb  and  6db),  of  the  TSE  with  respect  to  the  noisy  signal  (see  figures 
F.l  amd  F.3),  show  the  closeness  between  the  de-noised  signals  and  the  noisy  signal.  Ideally,  we 
would  like  the  processed  speech  signals  to  be  as  far  away  as  possible  from  the  noisy  signal,  indicated 
by  large  TSE.  All  the  bar-charts  illustrate  the  fact  that  the  STT  outperforms  the  HTT,  since  all 
the  STT  bars  have  higher  TSEs  than  the  HTT  bars. 

The  bar-charts  (SNRs  Odb  and  6db),  of  the  TSE  with  respect  to  the  clean  signal  (see  figures 
F.2  and  F.4),  show  the  closeness  between  the  de-noised  signals  and  the  cleam  signal.  Ideally,  we 
would  like  the  processed  speech  signals  to  be  very  close  to  the  clean  signad,  indicated  by  small  TSE. 
All  the  bar-charts  illustrate  the  faict  that  the  STT  outperforms  the  HTT,  since  all  the  STT  bairs 
have  lower  TSEs  than  the  HTT  bars. 
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Total  Error  (1000  M) 


SNR  s  Odb  (thresholded  vs  noisy) 


Figure  F.l  TSE  using  noisy  speech  and  the  de-noised  speech  (Odb)  with  wavelets:  db6,  coiflets, 
and  db20. 
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They  Enjoy  it  when  I  audtion 


Total  Error  (1000  M) 


SNR  as  Odb  (dtreaholded  vs  clean) 


Figure  F.2  TSE  using  clean  speech  and  the  de-noised  speech  (Odb)  with  wavelets:  db6,  coiflets, 
and  db20. 
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They  Enjoy  it  when  I  auddon 


Total  Error  (1000  M) 


SNR  a  6db  (thresbcrided  vs  noisy) 


Figure  F.3  TSE  using  the  noisy  speech  and  the  de-noised  speech  (6db)  with  wavelets:  db6,  coiflets, 
and  db20. 
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Iliey  Enjoy  it  when  I  audtioa 


Total  Error  (1000  M) 


SNR  s  6db  (tfaresholdcd  vs  clean) 


Figure  F.4  TSE  using  the  clean  speech  and  the  de-noised  speech  (6db)  with  wavelets:  db6,  coiflets, 
and  db20. 
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They  Enjoy  it  when  I  audtioa 


Appendix  G.  Spectrum  Analysis  Of  The  Clean  And  Noisy  Speech  Signals 

This  appendix  contains  both  the  wide-band  and  narrow-band  spectrograms  of  the  clean  speech 
signal,  the  6db  noisy  speech  signal,  and  the  Odb  noisy  speech  signal.  Observe,  the  high  energy  of 
the  first  formant  frequency  (below  SOOHz),  the  second  and  third  formants  frequencies  (below  3kHz). 
In  all  figures,  the  vertical  axis  represents  frequency  and  the  horizontal  axis  represents  samples  of 
the  signal  (sampling  frequency  is  16kHz). 
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Figure  G.2  Noisy  speech  wide-band  and  narrow-band  spectrums  (6db). 
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Figure  G.3  Noisy  speech  wide-band  and  narrow-band  spectrums  (Odb). 
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Appendix  H.  Spectrum  Analysis  Of  The  De-noised  Speech  Signals  (Odb  and  6db) 

Without  Using  Wavelets 

This  appendix  contains  the  wide-band  spectrograms  of  two  de-noised  speech  signals  (Odb 
and  Odb).  The  speech  signals  were  processed  using  the  soft  thresholding  technique  (STT)  and 
Stein’s  criteria.  Figure  H.l  shows  the  noisy  speech  data  processed  in  the  time  domain  using  Stein’s 
criteria.  Observe  that  when  the  signal-to-noise  ratio  (SNR)  is  Odb,  all  the  formant  frequencies 
are  still  preserved,  however,  the  third  formant  frequency  of  the  Odb  processed  speech  signal  was 
affected  by  the  non-linear  shrinkage.  Figure  H.2  shows  the  effects  of  applying  the  STT  to  the  real 
and  imaginary  parts  of  the  Fourier  transform  of  the  original  signal.  The  original  noisy  phase  was 
used  before  reconstruction  of  the  de-noised  speech  signal.  Observe  that  as  the  noise  level  increases 
(i.e.,  from  Odb  to  Odb),  the  formants  are  affected.  In  ail  figures,  the  vertical  axis  represents  frequency 
and  the  horizontal  axis  represents  samples  of  the  signal  (sampling  frequency  is  lOkHz). 
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Figure  H.l  De-aoised  speech  using  (ST)  wide-band  spectrum  (Odb  and  6db). 


H-2 


Fr«i  iHrIt  fidiBond  SpKtrui  (Ho  foytltco) 


I 

4 


9 


I 


I 

g 


Figure  H.2  De-noised  speech  using  (SRINP)  wide-band  spectrums  (Odb  and  6db). 
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Appendix  I.  Spectrum  Analysis  Of  The  De-noised  Speech  Signals  (Odb  and  6db) 

Using  Wavelets  In  Time 


This  appendbr.  contains  the  wide-hand  spectrograms  of  two  de-noised  speech  signals  (Odb  and 
6db).  The  speech  signals  were  processed  using  the  soft  thresholding  technique  (STT)  and  Stein’s 
criteria  was  applied  to  the  wavelet  transform.  Observe  the  aliasing  produced  by  db6  and  coiflet(6). 
This  aliasing  is  mainly  due  to  the  fact  that  these  wavelets  have  Fourier  transforms  with  many 
high  energy  side-lobes.  All  wavelets  used  (db6,  coiflet(6),  and  db20)  preserve  most  of  the  formant 
frequencies.  Notice  the  performance  of  the  db20;  no  aliasing  and  very  clear  formant  frequencies. 
In  all  figures,  the  vertic:.:  axis  represents  frequency  and  the  horizontal  axis  represents  samples  of 
the  signal  (sampling  frequency  is  16kHz). 
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Figure  1.1  De-aoised  speech  using  (WT  and  wavelet  db6)  wide-band  spectmms  (Odb  and  6db). 
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Figure  1.2  De-aoised  speech  using  (WT  and  wavelet  coiflet(6))  wide-band  spectmms  (Odb  and 
6db). 
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Appendix  J.  Spectrum  Analysis  Of  The  De-noised  Speech  Signals  (Odb  and  6db) 

Using  Wavelets  In  Fourier 


This  appendix  contains  the  wide-hand  spectrograms  of  two  de-noised  speech  signals  (Odb  and 
Odb).  The  speech  signals  were  processed  using  the  soft  thresholding  technique  (STT)  and  Stein’s 
criteria  was  applied  to  the  wavelet  transforms  of  both  the  real  and  imaginary  parts  of  the  Fourier 
transform  of  the  original  noisy  signals.  The  original  noisy  phase  was  used  before  reconstruction 
of  the  de-noised  signal.  Observe  that  most  of  the  formant  frequencies  are  still  preserved  for  all 
wavelets  used  and  that  there  is  no  noticeable  aliasing  caused  by  any  wavelet.  In  all  figures,  the 
vertical  axis  represents  frequency  and  the  horizontal  axis  represents  samples  of  the  signal  (sampling 
frequency  is  16kHz). 
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